All of Your Network Monitoring is (probably) Wrong
-
Upload
ice799 -
Category
Technology
-
view
2.174 -
download
5
Transcript of All of Your Network Monitoring is (probably) Wrong
![Page 1: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/1.jpg)
All of Your Network Monitoring is (probably) Wrong
joe damato packagecloud.io
![Page 2: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/2.jpg)
greetings
![Page 4: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/4.jpg)
packagecloud.io@packagecloudio
![Page 5: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/5.jpg)
follow along
blog.packagecloud.io
![Page 6: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/6.jpg)
cognitive load
![Page 7: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/7.jpg)
too much stuff
![Page 8: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/8.jpg)
cognitive load
copy & paste configs
![Page 9: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/9.jpg)
BTWThis is actually part of another talk I’m working on called
Programmers should get paid more & work less
![Page 10: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/10.jpg)
anw
![Page 11: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/11.jpg)
cognitive load
ever copy/paste !conf or tuning settings you didn’t understand?
![Page 12: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/12.jpg)
(probably)
![Page 13: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/13.jpg)
there’s too much damn code
![Page 14: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/14.jpg)
similarly...
![Page 15: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/15.jpg)
cognitive load
do you really understand every single graph you are generating?
![Page 16: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/16.jpg)
(probably not)
![Page 17: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/17.jpg)
there’s too much damn code
![Page 18: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/18.jpg)
If there’s too much damn code to configure and tune
![Page 19: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/19.jpg)
what makes you think you can actually
monitor it?
![Page 20: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/20.jpg)
spoiler: you can’t
![Page 21: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/21.jpg)
(prob. doesn't matter, more on this later)
![Page 22: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/22.jpg)
claim: the more complex the system is, the harder it is to monitor
![Page 23: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/23.jpg)
NOTE
complexity != bad
![Page 24: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/24.jpg)
from: https://www.flickr.com/photos/49128298@N04/24290342544/
![Page 25: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/25.jpg)
NOTE• you want to:
• download and play the cat game on your an Phone (cause small portable electronic devices)
• while messaging your buds (cause ur lonely)
• with in app purchases (cause you need more gold fish)
• over a TLS encrypted connection (cause payments)
• over a VPN (cause china)
• while flying (cause boredom)
![Page 26: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/26.jpg)
and that’s OK
it doesn’t mean complexity is bad
![Page 27: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/27.jpg)
2 complicated things that aren’t necessarily bad
![Page 30: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/30.jpg)
so, like, you know one thing that’s p. complicated?
![Page 31: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/31.jpg)
the Linux networking stack
![Page 32: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/32.jpg)
all kiiiiiiiiiiiiiiiiiiiiiiiinda features• different NICs have different rx and tx queue size limits and defaults
• ethernet bonding
• IRQ modulation, ntuple filtering, ….
• RSS, RPS, RFS, aRFS
• GRO, GSO, hw accelerated VLAN IDs, timestamping, ….
• you are probably using at least 2 protocol stacks (IP and TCP/UDP)
• all kiiiiiiiiiiiiiiinda tuning levers and knobs for everything from top to bottom
![Page 33: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/33.jpg)
all kiiiiiinda bugs
![Page 34: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/34.jpg)
and
![Page 35: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/35.jpg)
basically
![Page 36: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/36.jpg)
literally
![Page 37: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/37.jpg)
actually
![Page 38: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/38.jpg)
really
![Page 39: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/39.jpg)
no docs
![Page 40: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/40.jpg)
https://www.redhat.com/archives/rhl-list/2007-September/msg03735.html
> How can I find out the /proc/net info > > eg: softnet_stat is for what purpose !
Much of this is only well-documented in the code. Here's an attempt at interpreting softnet_stat [no guarantee that it is correct; read the code!]:
![Page 41: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/41.jpg)
total # of packets (not including netpoll) received by the interrupt handler. There might be some double counting going on [ … ] I think the intention was that these were originally on separate receive paths
![Page 42: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/42.jpg)
Full networking writeupliterally 90 pages
literally everything about linux networking
literally available here: http://bit.ly/linux-networking
![Page 43: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/43.jpg)
it’s fine, as long as we are honest that it’s just reality
![Page 44: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/44.jpg)
![Page 45: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/45.jpg)
[random os] has a better/faster/leaner/whatever networking stack
than linux
![Page 46: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/46.jpg)
![Page 47: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/47.jpg)
anw
![Page 48: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/48.jpg)
complex• not necessarily inefficient • not necessarily bad • people expect a lot of complicated features • so theres a lot of code needed to support all
this random stuff you want to do • see also: cat game example
![Page 49: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/49.jpg)
complex
• the bad news….. • you are supposed to monitor this complicated code • and then you are supposed to look at some graphs • and then you are supposed to Know The Answer™
![Page 50: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/50.jpg)
sounds difficult… but it gets better ;)
![Page 51: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/51.jpg)
what if i told you…
![Page 52: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/52.jpg)
a driver bug caused stats to output incorrectly in
/proc/net/dev?
![Page 53: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/53.jpg)
igb• driver stats updated via a timer every 2 secs • reading stats via /proc/net/dev produced stale stats • but not via ethtool (different code path) • fixed by forcing stat update whenever stats are read • i saw this in production —— did you????
![Page 54: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/54.jpg)
only matters if you aremonitoring your network stats
more often than every 2 sec.
![Page 55: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/55.jpg)
maybe you aren’tbecause you dont care
(that’s fine and you are prob. right)
![Page 56: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/56.jpg)
but if you do care…
![Page 57: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/57.jpg)
your future status• you’d need to:
• notice the problem in your graph • start reading your stats collecting code/plugin • realize the bug is not there • read your driver code • realize the bug is in the code path that /proc/net/dev hits • write a patch to fix it • rebuild the driver and deploy it everywhere
![Page 58: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/58.jpg)
that’s a lot of work
just to monitor bytes tx/rx
greetings!
![Page 59: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/59.jpg)
things that dont exist• descending order of probability it doesn't exist:
• free open source • the an singularity • calorie free chocolate covered bacon • etc
![Page 60: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/60.jpg)
![Page 61: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/61.jpg)
but joe, my devops are the literal strongest and they doesn't afraid of
anything
![Page 62: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/62.jpg)
i’ll just use the ethtool
![Page 63: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/63.jpg)
ethtool
• a command line tool • uses ioctl system call to talk to network drivers • not all drivers actually implement the interface • and the ones which do, generally, do it differently
![Page 64: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/64.jpg)
what if i told you…
![Page 65: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/65.jpg)
ethtool
• no standardized way of outputting driver stats • some drivers don’t even implement the interface • the ones which do use diff field names
![Page 66: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/66.jpg)
¡dale, comparemos!
• ec2 vif driver • ixgbe driver • igb driver
![Page 67: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/67.jpg)
ec2 vif driver
$ sudo ethtool -S eth0 NIC statistics: rx_gso_checksum_fixup: 0
ethtool outputs 1 statistic
![Page 68: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/68.jpg)
what even is rx_gso_checksum_fixup?
![Page 69: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/69.jpg)
joe’s ixgbe driver on an Real Computer
ethtool outputs 377 statistics
![Page 70: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/70.jpg)
NIC statistics: rx_packets: 9665600259 tx_packets: 12198470686 rx_bytes: 6790400019470 tx_bytes: 2169046666156 rx_pkts_nic: 11107310349 tx_pkts_nic: 12198470686 rx_bytes_nic: 6929982126806 tx_bytes_nic: 2217848697965 lsc_int: 1 tx_busy: 0 non_eop_descs: 1044042523 rx_errors: 0 tx_errors: 0 rx_dropped: 1 tx_dropped: 0 multicast: 7876979 broadcast: 2633 rx_no_buffer_count: 0 collisions: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 hw_rsc_aggregated: 6600573569 hw_rsc_flushed: 5158863479 fdir_match: 175127 fdir_miss: 11098854004 fdir_overflow: 1 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_timeout_count: 0 tx_restart_queue: 0 rx_long_length_errors: 0 rx_short_length_errors: 0
tx_flow_control_xon: 0 rx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_flow_control_xoff: 0 rx_csum_offload_errors: 0 alloc_rx_page_failed: 0 alloc_rx_buff_failed: 0 rx_no_dma_resources: 0 os2bmc_rx_by_bmc: 0 os2bmc_tx_by_bmc: 0 os2bmc_tx_by_host: 0 os2bmc_rx_by_host: 0 fcoe_bad_fccrc: 0 rx_fcoe_dropped: 0 rx_fcoe_packets: 0 rx_fcoe_dwords: 0 fcoe_noddp: 0 fcoe_noddp_ext_buff: 0 tx_fcoe_packets: 0 tx_fcoe_dwords: 0 tx_queue_0_packets: 650250933 tx_queue_0_bytes: 109734794973 tx_queue_1_packets: 734133738 tx_queue_1_bytes: 123318917069 tx_queue_2_packets: 772808083 tx_queue_2_bytes: 131183014063 tx_queue_3_packets: 741428236 tx_queue_3_bytes: 125821603228 tx_queue_4_packets: 692281561 tx_queue_4_bytes: 118278086880 tx_queue_5_packets: 783438226 tx_queue_5_bytes: 133234307795 tx_queue_6_packets: 719335931 tx_queue_6_bytes: 123662184314 tx_queue_7_packets: 668577198 tx_queue_7_bytes: 114915688397
tx_queue_8_packets: 711699909 tx_queue_8_bytes: 122460443627 tx_queue_9_packets: 681741781 tx_queue_9_bytes: 118032356999 tx_queue_10_packets: 585639061 tx_queue_10_bytes: 98009207733 tx_queue_11_packets: 640487443 tx_queue_11_bytes: 107781535416 tx_queue_12_packets: 706304786 tx_queue_12_bytes: 118963058912 tx_queue_13_packets: 716825472 tx_queue_13_bytes: 121032769231 tx_queue_14_packets: 699280537 tx_queue_14_bytes: 118119557225 tx_queue_15_packets: 675274048 tx_queue_15_bytes: 114916452394 tx_queue_16_packets: 123509474 tx_queue_16_bytes: 25473914817 tx_queue_17_packets: 101309066 tx_queue_17_bytes: 23513562050 tx_queue_18_packets: 92291301 tx_queue_18_bytes: 21830243983 tx_queue_19_packets: 87287348 tx_queue_19_bytes: 20887753665 tx_queue_20_packets: 34518707 tx_queue_20_bytes: 9837323388 tx_queue_21_packets: 24009284 tx_queue_21_bytes: 6760172375 tx_queue_22_packets: 23628875 tx_queue_22_bytes: 6707751077 tx_queue_23_packets: 25969617 tx_queue_23_bytes: 7343742932 tx_queue_24_packets: 30112206 tx_queue_24_bytes: 8614816667 tx_queue_25_packets: 28812367 tx_queue_25_bytes: 8186825345 tx_queue_26_packets: 31710307 tx_queue_26_bytes: 9139202059
tx_queue_27_packets: 40835241 tx_queue_27_bytes: 11499713701 tx_queue_28_packets: 39265877 tx_queue_28_bytes: 11045989548 tx_queue_29_packets: 41775414 tx_queue_29_bytes: 11804871879 tx_queue_30_packets: 12497615 tx_queue_30_bytes: 3405490173 tx_queue_31_packets: 11021513 tx_queue_31_bytes: 2659215149 tx_queue_32_packets: 10464342 tx_queue_32_bytes: 2632864135 tx_queue_33_packets: 11341007 tx_queue_33_bytes: 2818638887 tx_queue_34_packets: 12782059 tx_queue_34_bytes: 3307226594 tx_queue_35_packets: 12795212 tx_queue_35_bytes: 3400547658 tx_queue_36_packets: 59272452 tx_queue_36_bytes: 17286517363 tx_queue_37_packets: 85631445 tx_queue_37_bytes: 25126772743 tx_queue_38_packets: 84708817 tx_queue_38_bytes: 24920451495 tx_queue_39_packets: 83763431 tx_queue_39_bytes: 24662854523 tx_queue_40_packets: 0 tx_queue_40_bytes: 0 tx_queue_41_packets: 0 tx_queue_41_bytes: 0 tx_queue_42_packets: 0 tx_queue_42_bytes: 0 tx_queue_43_packets: 0 tx_queue_43_bytes: 0 tx_queue_44_packets: 0 tx_queue_44_bytes: 0 tx_queue_45_packets: 0 tx_queue_45_bytes: 0 tx_queue_46_packets: 0 tx_queue_46_bytes: 0 tx_queue_47_packets: 0 tx_queue_47_bytes: 0 tx_queue_48_packets: 0 tx_queue_48_bytes: 0 tx_queue_49_packets: 0 tx_queue_49_bytes: 0 tx_queue_50_packets: 0 tx_queue_50_bytes: 0 tx_queue_51_packets: 0 tx_queue_51_bytes: 0
tx_queue_52_packets: 0 tx_queue_52_bytes: 0 tx_queue_53_packets: 0 tx_queue_53_bytes: 0 tx_queue_54_packets: 0 tx_queue_54_bytes: 0 tx_queue_55_packets: 0 tx_queue_55_bytes: 0 tx_queue_56_packets: 0 tx_queue_56_bytes: 0 tx_queue_57_packets: 0 tx_queue_57_bytes: 0 tx_queue_58_packets: 0 tx_queue_58_bytes: 0 tx_queue_59_packets: 0 tx_queue_59_bytes: 0 tx_queue_60_packets: 0 tx_queue_60_bytes: 0 tx_queue_61_packets: 0 tx_queue_61_bytes: 0 tx_queue_62_packets: 0 tx_queue_62_bytes: 0 tx_queue_63_packets: 0 tx_queue_63_bytes: 0 tx_queue_64_packets: 0 tx_queue_64_bytes: 0 tx_queue_65_packets: 0 tx_queue_65_bytes: 0 tx_queue_66_packets: 0 tx_queue_66_bytes: 0 tx_queue_67_packets: 0 tx_queue_67_bytes: 0 tx_queue_68_packets: 0 tx_queue_68_bytes: 0 tx_queue_69_packets: 0 tx_queue_69_bytes: 0 tx_queue_70_packets: 0 tx_queue_70_bytes: 0 tx_queue_71_packets: 0 tx_queue_71_bytes: 0 rx_queue_0_packets: 677531848 rx_queue_0_bytes: 468871724028 rx_queue_1_packets: 756010412 rx_queue_1_bytes: 552322849015 rx_queue_2_packets: 790165770 rx_queue_2_bytes: 598765367940 rx_queue_3_packets: 759308572 rx_queue_3_bytes: 563185206581 rx_queue_4_packets: 716336754 rx_queue_4_bytes: 527389980455 rx_queue_5_packets: 816622000
![Page 71: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/71.jpg)
of those 377….
none of them are: rx_gso_checksum_fixup
![Page 72: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/72.jpg)
joe’s igb driver on an Real Computer
ethtool outputs 112 statistics
![Page 73: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/73.jpg)
similarly, of those 112….
none of them are: rx_gso_checksum_fixup
![Page 74: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/74.jpg)
surely 2 intel drivers
will have similar stats
![Page 75: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/75.jpg)
ixgbe diff igb =>
316 diff stats
![Page 76: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/76.jpg)
and it gets better!
![Page 77: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/77.jpg)
some measured in driver
some measured in hw
![Page 78: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/78.jpg)
monitor all the things!!11!!1!
![Page 79: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/79.jpg)
non_eop_descs ???
os2bmc_rx_by_bmc ???
rx_no_dma_resources ???
![Page 80: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/80.jpg)
![Page 81: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/81.jpg)
this is fine
i’ll read the driver source
i’m really good at the kernels
![Page 82: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/82.jpg)
this is fine
case ixgbe_mac_82599EB: for (i = 0; i < 16; i++) adapter->hw_rx_no_dma_resources += IXGBE_READ_REG(hw, IXGBE_QPRDC(i));
![Page 83: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/83.jpg)
IXGBE_QPRDC
uh, wat?
![Page 84: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/84.jpg)
similar?
![Page 85: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/85.jpg)
register read
• this driver, like many others, gets some stats from the NIC • it does this by reading register values
![Page 86: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/86.jpg)
documentation?
so we should be able to find this in the NIC data sheet………. right?
![Page 87: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/87.jpg)
![Page 88: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/88.jpg)
page 689
![Page 89: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/89.jpg)
success• so just repeat this process:
• get the driver code • read it for every stat • figure out if stat is in software or hardware • if its in software read the driver and figure out what it means • if its in hardware find the data sheet and figure out what it means • then graph it • and then figure out what the graph means
![Page 90: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/90.jpg)
greetings
![Page 91: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/91.jpg)
what if i told you…
![Page 92: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/92.jpg)
some of these stats aren’t documented in the data
sheet?
![Page 93: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/93.jpg)
so, like, theres nothing you can do except literally
guess.
![Page 94: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/94.jpg)
you could email the device manufacturer….
![Page 95: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/95.jpg)
![Page 96: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/96.jpg)
![Page 97: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/97.jpg)
no one cares, joe• no one cares about NIC level stats • too low level • /proc/net/dev works on my computer for tx/rx • and it has high level summaries • errors! drops! fifo! frame! compressed!
![Page 98: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/98.jpg)
but what doerrors! drops! fifo! frame! compressed!
mean?
![Page 99: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/99.jpg)
/proc/net/dev
![Page 100: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/100.jpg)
OK, so• all these fields come from the driver • some are from software, others from the NIC • some fields are sums of the other fields • this reduces your data sheet search space • just search for the fields you care about
![Page 101: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/101.jpg)
but what if…the drivers don’t agree with
each other on what the individual statistics represent?
![Page 102: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/102.jpg)
in other words..
what if: driver_stats->rx_missed_errors
means something different for each driver you ask?
![Page 103: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/103.jpg)
greetings
![Page 104: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/104.jpg)
meaning of driver stats are not standardized
![Page 105: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/105.jpg)
BTW
![Page 106: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/106.jpg)
stat meanings for a driver/device can change over time.
![Page 107: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/107.jpg)
so:• you need to figure out which NICs are in prod for all boxes • which firmware versions used on each NIC • which versions of drivers used for each NIC • read the all driver sources for the fields you care about • read the data sheet to figure out what the fields mean • build An Collectd plugin (or w/e) to encapsulate this
knowledge
![Page 108: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/108.jpg)
maybe you don’t care
too low level
you care about protocol level stats
![Page 109: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/109.jpg)
odd b/c ethtool settings can eliminate protocol
stack problems
you dont care?
![Page 110: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/110.jpg)
but, w/e
![Page 111: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/111.jpg)
let’s just read protocol stats from /proc/net/snmp!
![Page 112: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/112.jpg)
/proc/net/snmp• there’s an RFC !!!!!!!! (rfc 2013) • the fields are standardized!!!! • it’s higher level, so i can figure out where the
protocol layers are breaking down!!! • they are gathered mostly in software • much easier than reading a 1200 pg data sheet
![Page 113: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/113.jpg)
what if i told you…
![Page 114: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/114.jpg)
BUGS
• several cases where counters are incremented in the wrong place
• several cases where counters double count
![Page 115: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/115.jpg)
BUGS
several cases where counters aren’t incremented where you might think they should be
![Page 116: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/116.jpg)
BUGS
/*! * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.! * Reporting ENOBUFS might not be good! * (it's not tunable per se), but otherwise! * we don't have a good statistic (IpOutDiscards but it can be too many! * things). We could add another new stat but at least for now that! * seems like overkill.! */!
from linux 3.13.0 net/core/udp.c:
![Page 117: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/117.jpg)
If this is an important statistic for you, your monitoring might be wrong
![Page 118: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/118.jpg)
so, what does this mean?
![Page 119: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/119.jpg)
so, what does this mean?
• monitoring something requires very deep understanding
• otherwise your graphs, alerts, etc might not actually be measuring what you think they are measuring
![Page 120: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/120.jpg)
so, what does it mean?
• this is why people build entire businesses around monitoring networks (or other stuff).
• resist the urge to think you can solve every problem with a “quick” bash script
![Page 121: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/121.jpg)
so, what does it mean?
• properly monitoring, setting alerts, etc requires significant investment
• i.e. not a bash script over the weekend • again, not necessarily bad, just important to think
about
![Page 122: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/122.jpg)
• nothing is free • this doesn’t mean that the software is bad • so plz don't jump to that conclusion • this is just reality
so, what does it mean?
![Page 123: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/123.jpg)
and now an aside
![Page 124: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/124.jpg)
time, money, and business
• engineers always think they can solve everything by writing enough software
• the problem is that: sometimes spending your time doing that makes no business sense.
• other times doing that is actively detrimental
![Page 125: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/125.jpg)
time, money, and business• how long would it take you to:
• figure out if all your networking metrics are right • figure out what they all mean • set alerts that are sensible
• remember: you need to read a lot of code, data sheets, and potentially several versions of different drivers.
![Page 126: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/126.jpg)
time, money, and business
https://baremetrics.com/calculator
(add at least 35% overhead to salaries)
![Page 127: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/127.jpg)
time, money, and business
and this is why monitoring all the things makes no business sense
for most businesses below a certain revenue level
![Page 128: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/128.jpg)
time, money, and business
and this is also why it doesn’t really matter if these stats are wrong
if these stats actually matter to your business, your business will invest the $$$$ to figure this out
![Page 129: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/129.jpg)
in conclusion:• complexity is not necessarily bad • even simple software is buggy and hard to monitor
correctly • it all comes down to value and time • your network monitoring is probably wrong • but it probably doesn't matter because if it did, your
company would invest $$$$ in figuring it out
![Page 130: All of Your Network Monitoring is (probably) Wrong](https://reader034.fdocuments.net/reader034/viewer/2022052606/58889a081a28ab264b8b4ac9/html5/thumbnails/130.jpg)
?packagecloud.io@packagecloudio