Alfredo paganophd 3y
-
Upload
alfredo-pagano -
Category
Technology
-
view
191 -
download
0
Transcript of Alfredo paganophd 3y
Ferrara, Thursday, April 13, 2023
Corso di Dottorato in Matematica e Informatica Università degli studi di Ferrara
2nd Year PhD Activities Report (2009)
Alfredo Pagano
Advisor: Prof. Eleonora Luppi
Coadvisor: Dr. Mario Reale
1
Ferrara, Thursday, April 13, 2023Alfredo Pagano
2
Present employment
Sysadmin in charge of installing, supporting, and maintaining servers and core services at GARR
The aim of Consortium GARR is to plan, manage, and operate the Italian National Research and Education Network, implementing the most advanced technical solutions and services.
Networking Support Activity (EGEE-III SA2)
Enabling Grids for E-sciencE (EGEE) is Europe's leading grid computing project, providing a computing support infrastructure for over 10,000 researchers world-wide, from fields as diverse as high energy physics, earth and life sciences.
Ferrara, Thursday, April 13, 2023Alfredo Pagano
3
GARR-G: Backbone Physical Infrastructure
GARR-G: PoP-level topology About 45 GARR PoP
90% in Univ. research institutions premises
other in telco operators and Internet eXchange premises
About 60 backbone links Mainly leased lines from 8 telco operators Core links
10 Gbps (STM-64) and 2.5 Gbps (STM-16)
Edge links 34 Mbps, 155 Mbps and 622 Mbps
Peering links 10 Gbps (STM-64 and
10GigabitEthernet), 2.5 Gbps (STM-16) and 1 Gbps (1 GigabitEthernet)
Backbone capacity 120 Gbps Peering capacity 40 Gbps
Ferrara, Thursday, April 13, 2023Alfredo Pagano
4
GARR-G: Backbone Logical Infrastructure
Basically each IP backbone link corresponds to a leased lines from different operators
GARR-G is an IP multivendor network Juniper M320, M20 and M10i Cisco 12000, 7500 and 7200
20
RT.FI1
RT.TO1
RT.MI2
GEANT
TELIA
20
5
9
10
1120
10
20
MIX
GlobalX
RC.GE1
RC.TS1
14
30
50
11
1
RT1.BO1
RT1.MI1
RT.MI3
18
1
RC.TN1
19
RT.BA110
RC.MTRC.SA RC.CB
111
1 11
RC.LE
RC.FG
RC.PZ
Juniper M320
Juniper M20
Cisco 7500
Telecom Infracom Interoute Fastweb Wind D.F/ Peering
9
RT.CT1
RC.ME
RC.CS
1
120
RC.PA1
20
RT.NA1
19
RC.PGRC.AQ
RC.AN
111
RT.RM1
RC.CA
RC.SS
1
1
RT.RM2
Level 320
1
RC.RM2
1NaMeX
RC.FRA
20
1
RT.PI1
RC.FI
19
4
RC.VE
RT.PD1
1
25
20
Cisco 7200
1
1
50RC.AQ1
100 1
1 + 2 GB
2.5GB
2.5GB
2.5GB
2.5GB
155 MB
155 MB 10 + 1 GB
2 X
155
MB
2 X 155 MB
2.5GB
2.5G
B15
5 M
B
RC.CA1
Juniper M10i
1
RC.PV
1
Juniper M7i
1RC.UR
20
RT.FI1
RT.TO1
RT.MI2
GEANT
TELIA
20
5
9
10
1120
10
20
MIX
GlobalX
RC.GE1
RC.TS1
14
30
50
11
1
RT1.BO1
RT1.MI1
RT.MI3
18
1
RC.TN1
19
RT.BA110
RC.MTRC.SA RC.CB
111
1 11
RC.LE
RC.FG
RC.PZ
Juniper M320
Juniper M20
Cisco 7500
Telecom Infracom Interoute Fastweb Wind D.F/ Peering
9
RT.CT1
RC.ME
RC.CS
1
120
RC.PA1
20
RT.NA1
19
RC.PGRC.AQ
RC.AN
111
RT.RM1
RC.CA
RC.SS
1
1
RT.RM2
Level 320
1
RC.RM2
1NaMeX
RC.FRA
20
1
RT.PI1
RC.FI
19
4
RC.VE
RT.PD1
1
25
20
Cisco 7200
1
1
50RC.AQ1
100 1
1 + 2 GB
2.5GB
2.5GB
2.5GB
2.5GB
155 MB
155 MB 10 + 1 GB
2 X
155
MB
2 X 155 MB
2.5GB
2.5G
B15
5 M
B
RC.CA1
Juniper M10i
1
RC.PV
1
Juniper M7i
1RC.UR
Ferrara, Thursday, April 13, 2023Alfredo Pagano
5
PhD subject Motivation, Mission and Scope The idea: pros and cons Metrics collected First results Next steps
5
Ferrara, Thursday, April 13, 2023Alfredo Pagano
6
PhD subject
1st year activity - The research activity was focused on exploiting and testing network monitoring tools to gather relevant metrics related to end-to-end performances in Grid infrastructures.
2nd year activity – Design and create a first working version of a Grid Network monitoring tool based on Grid jobs
Ferrara, Thursday, April 13, 2023Alfredo Pagano
7
Motivation (1/2)
Debugging networks for efficiency an essential step for those wishing to run data intensive
applications
Optimizing performances of Grid middleware and applications to make intelligent use of the network adapting to changing network conditions
Supporting the Grid “utility computing” model ensuring that the differing network performances required by
particular Grid applications are provided, via measurable SLAs
Ferrara, Thursday, April 13, 2023Alfredo Pagano
8
Motivation (2/2)
An help for Site and Grid operations Help diagnose performance problems among sites
This transfer is slow, what’s broken? – the network, the server, the middleware… I can’t see site X, has the network gone down or is it just a particular service or
machine? My application’s performance varies with time of day – is there a network bottleneck?
Help diagnose problems within sites Most network problems, especially performance issues, are not backbone related, they
are in the “last mile” Help with planning and provisioning decisions
Is an SLA I’ve arranged being adhered to by my providers?
For Grid services and middleware I want to increase the performance of file transfers between sites I want to know which compute site is “closest” to my data to submit a job to it
Ferrara, Thursday, April 13, 2023Alfredo Pagano
9
The idea: pros and cons
Instead of installing a probe at each site, run a Grid Job!
Added value: No installation needed at the sites Monitoring system running on a proven system (the grid) & possibility to
use grid services Direct use of grid AuthN and AuthZ
Limits: The job is not running with root privileges on the Worker Node (WN)
Some low level operations are not permitted Heterogeneity of the WN environments (OS, 32/64 bits, etc.)
Ex: making the job download-and-run an external binary may be tricky (except if they are written in an OS independent programming language)
The system has to deal with the grid mechanism overhead (delays…)
Ferrara, Thursday, April 13, 2023Alfredo Pagano
10
The system in actionSite paris-urec-ipv6
UI
Central monitoring server program (CMSP)
Site A
WN
Job
Site X
WMS
CE
Site B
WN
Job
CE
Site C
WN
Job
CE
Job submissionSocket connection
Ready!
Probe Request
Request: RTT test to site A
Request: RTT test to site A
Request: BW test to site B
Request: BW test to site B
Ferrara, Thursday, April 13, 2023Alfredo Pagano
11
Some remarks
Chosen design is more efficient than starting a job for each probe (considering delays)
TCP connection is initiated by the job No open port needed on the WN -> better for sites security
An authentication mechanism is implemented between the job and the server
High scalability (Bend and Fend can be easily decoupled) A job cannot run forever (GlueCEPolicyMaxWallClockTime)
there are two jobs running at each site A ‘main’ one A ‘redundant’ one which is waiting and will become ‘main’ when the
other one ends
Ferrara, Thursday, April 13, 2023Alfredo Pagano
12
Round-Trip time, MTU and hop count testsSite paris-urec-ipv6
UI
Central monitoring server program (CMSP)
Site B
WN
Job
Site C
CE
Socket connection Probe Request
Request: RTT test to site C
Request: RTT test to site C
Probe Result
Ferrara, Thursday, April 13, 2023Alfredo Pagano
13
Round-Trip time, MTU and hop count tests
The ‘RTT’ measure is the time a TCP ‘connect()’ function call takes: Because a connect() call involves a round-trip of packets:
SYN -> SYN-ACK <- ACK ->
Results similar to the ones of ‘ping’ The MTU is given by the IP_MTU socket option The number of hops is calculated in an iterative way All these measures require:
To connect to an accessible port (1) on a machine of the remote site To close the connection (no data is sent) Note: This (connect/disconnect) is detected in the application log
(1): We use the port of the gatekeeper of the CE since it is known to be accessible (it is used by gLite)
Round tripRound trip
Just sending => no network delayJust sending => no network delay
Ferrara, Thursday, April 13, 2023Alfredo Pagano
14
WN-to-WN BW test (may be obsoleted)
Site paris-urec-ipv6
UI
Central monitoring server program (CMSP)
Site A
WN
Job
Site C
WN
Job
Probe Request
Request: BW test to wn-site-C:<p>
Request: BW test to wn-site-C:<p>
Request: Open a TCP port <p>
Request: Open a TCP port <p>
Socket connection
Sending a big amount of dataSending a big
amount of data
Probe Result
Ferrara, Thursday, April 13, 2023Alfredo Pagano
15
GridFTP BW testSite paris-urec-ipv6
UI
Central monitoring server program (CMSP)
Site A
WN
Job
Site C
Probe Request
Request: GridFTP BW test to site
C
Request: GridFTP BW test to site
C
Socket connection
SE SE
Replication of a big grid fileReplication of a big grid file
Read the gridFTP log file
Read the gridFTP log file
Probe Result
Ferrara, Thursday, April 13, 2023Alfredo Pagano
16
WN-to-WN BW test (under discussion)
It requires the remote site to allow incoming TCP connections to the WN Not a best practice security policy Not always possible (WNs behind a NAT)
Workaround are sometimes possible
WN WN transfers doesn’t reflect real use case The WN network connectivity may not be adapted
Ferrara, Thursday, April 13, 2023Alfredo Pagano
17
Metrics collected & scheduling
Latency test Ping Every 5 minutes
Hop list Traceroute Every 5 minutes
MTU size Socket (IP_MTU socket option) Every 5 minutes
Achievable Bandwidth TCP throughput transfer via GridFTP transfer between 2 Storage Element Every 8h
Ferrara, Thursday, April 13, 2023Alfredo Pagano
18
8 Sites involved
A. Paris Urec CNRS
B. IN2P3 Lyon
C. INFN-CNAF
D. INFN-ROMA1
E. INFN-ROMA-CMS
F. GRISU-ENEA-GRID
G. INFN-BARI
H. INFN-CATANIA
Ferrara, Thursday, April 13, 2023Alfredo Pagano
19
Traceroute Paris-Catania [pagano@ui-ipv6-testbed ~]$ traceroute grid005.ct.infn.it
traceroute to grid005.ct.infn.it (193.206.208.18), 30 hops max, 38 byte packets 1 194.57.137.190 (194.57.137.190) 1.589 ms 1.479 ms 2.696 ms 2 r-interco-urec.reseau.jussieu.fr (134.157.247.38) 0.331 ms 0.273ms 0.348 ms 3 r-jusrap-reel.reseau.jussieu.fr (134.157.254.124) 0.368 ms 0.320ms 0.318 ms 4 interco-6.01-jussieu.rap.prd.fr (195.221.127.181) 0.258 ms 0.330ms 0.255 ms 5 * * * 6 te1-2-paris1-rtr-021.noc.renater.fr (193.51.189.230) 1.224 ms1.127 ms 1.121 ms MPLS Label=489 CoS=6 TTL=1 S=0 7 te0-0-0-3-paris1-rtr-001.noc.renater.fr (193.51.189.37) 1.182 ms1.298 ms 1.150 ms
12 rt1-mi1-rt-mi2.mi2.garr.net (193.206.134.190) 17.555 ms 17.533ms 17.646 ms13 rt-mi2-rt-rm2.rm2.garr.net (193.206.134.230) 26.996 ms 27.071 ms 27.183 ms14 rt-rm2-rt-rm1-l1.rm1.garr.net (193.206.134.117) 27.050 ms 27.160ms 27.062 ms15 rt-rm1-rt-ct1.ct1.garr.net (193.206.134.6) 44.854 ms 44.882 ms 44.820 ms16 rt-ct1-ru-infngrid.ct1.garr.net (193.206.137.186) 45.115 ms 45.013 ms 45.009 ms17 grid005.ct.infn.it (193.206.208.18) 45.014 ms 44.967 ms 44.913 m
8 renater.rt1.par.fr.geant2.net (62.40.124.69) 1.154 ms 1.153 ms 1.128 ms 9 so-7-3-0.rt1.gen.ch.geant2.net (62.40.112.29) 9.950 ms 9.958 ms 10.018 ms10 so-3-3-0.rt1.mil.it.geant2.net (62.40.112.210) 17.264 ms 17.314ms 17.372 ms11 garr-gw.rt1.mil.it.geant2.net (62.40.124.130) 17.369 ms 21.514ms 17.368 ms
Ferrara, Thursday, April 13, 2023Alfredo Pagano
20
Frontend view:
Ldap Authentication, based on Google Web Toolkit (GWT) framework
Ferrara, Thursday, April 13, 2023Alfredo Pagano
21
Next steps:
1. Triggering system to alert site and network admins
2. Frontend improvements (plotting graphs)
3. Not only scheduled, but also on-demand measurements
…
Ferrara, Thursday, April 13, 2023Alfredo Pagano
22
Thank you for your attention!
Ferrara, Thursday, April 13, 2023Alfredo Pagano
23
Backup
Ferrara, Thursday, April 13, 2023Alfredo Pagano
24
GridFTP BW test
This test shows good results If the GridFTP log file is not accessible (cf. dCache?)
We just do the transfer via globus-url-copy and measure the time it takes
This is slightly less precise How many streams should we request in the command line?globus-url-copy –p <num_streams> […]
Ferrara, Thursday, April 13, 2023Alfredo Pagano
25
Network Performance Factors End System Issues
Network Interface Card and Driver and their configuration
TCP and its configuration Operating System and its configuration Disk System Processor speed Bus speed and capability Application eg old versions of scp
Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e.g., firewalls) Sub-optimal routing Transport Protocols
Network Capacity and the influence of Others!
Many, many TCP connections Congestion