UK TOP SECRET COMINT Reference OP TECHB 6 1 Date 13 June 2011 Copy no A potential technique to deanonyrnise users of the TOR network Summary A new technique to deanonymise users of the TOR network is demonstrated It is shown that the majority of a small truthed set of data can be deanonymised without any false hits anywhere in eight bearer hours of data For this algorithm to be further tested we must run some TOR exit nodes and collect data from these SIGINT packet logging of guard node traf c is also required Demonstration software is available in the form of an package from Copy Distribution SEQUOIA 7 ICTR 18 pages UK TOP SECRET COMINT UK TOP SECRET COMINT THIS PAGE IS INTENTIONALLY LEFT BLANK 2 UK TOP SECRET COMINT I UK TOP SECRET COMINT 61 A potential technique to deanonymise users of the TOR network OPC-MCR GCHQ 13 June 2011 1 Introduction The Onion Router TOR is used by individuals and organisations that want to hide the originating IP address of their communications A TOR client chooses the circuit their traf c will follow through a set of TOR routers before contacting the destination server The rst hop after the client is to the subset of the routers designated as guard nodes The nal hop in the TOR network is from a subset of routers which choose to be an exit node There can be any number of TOR routers between the guard and exit nodes but typical clients choose to have one router There are many features that mean it is hard to track traf c through the TOR network 0 Traf c is in multiple layers between the client node and each TOR router leading to the Onion analogy Hence data between each TOR router has different ciphertexts traf c is only seen between the exit node and destination server 0 TOR splits all traf c up into standard size cells Therefore packet sizes can not be used to follow traf c 0 Each connection between TOR routers will typically multiplex many circuits traf c effectively masking any particular user s traf c 0 To ensure fairness of service TOR has a per circuit rate limiting store and forward buffer in each router This buffer tends to flatten timing features in traf c The aim of this work is to nd the client IP address associated with traf c between an exit node and a destination server This paper achieves this aim given the following constraints 0 We must own the exit node This constraint means that we can demultiplex traf c by TOR circuit and thus get a cleaner signal See section 5 for more detail 3 This information is exempt under the Freedom of Information Act 2000 and mav be exem 1 under other UK information legislation Refer any queries to GCI K on UK TOP SECRET COMINT 61 0 we must be able to log the packet times of the traf c from the guard node to the client 0 The TOR communication must be long and structured we will demonstrate the technique against a single user browsing the web via TOR The client must not be running a TOR router of their own otherwise we can not separate the client s own traf c from other TOR traf c The attack we will present is based on correlating exit node and guard node traf c and does not require tracking communications through any intermediate link in the TOR network This approach should help maximise the chance of a successful attack despite incomplete SIGINT coverage 2 Test data we are going to work with two sets of data 1 Truthed data where we have matched guard node and exit node traf c for a single user web browsing 2 Bulk traf c logs for guard nodes from four SIGIN bearers we will try to develop techniques that can correctly match up the truth data but not false alarm anywhere in the bulk SIGIN logs 2 1 Truthed data TOR is designed to make it hard to link client and exit node traf c together In conjunc tion with and JTRIG we came up with a way to collect the exit node traf c from our own web browsing As illustrated in gure 1 we used a virtual private server VPS as an intermediate destination for our traf c We then run packet loggers on our client machine and the VPS we run an open HTTP proxy server on the VPS we want to run an open proxy server to ensure that there is minimal impact on the data flows which for example authenti cation may introduce whilst still being able to conduct representative web browsing A risk of an open proxy server is that other internet users may use it thus potentially lead ing to wful traf c interception To avoid this danger we changed our User Agent string to a non standard value and the proxy server was con gured to only respond to this user agent Furthermore the proxy server was only active for the brief duration of the experiment This setup was approved by JTRIG No traf c due to other users was detected in the packet logs 4 This informaiion is exempL under Lhe Freedom of Informaiion ACL 2000 FOIA and mav be exem 3L under oLher UK informaiion legislaiion Refer any FOIA queries Lo GCHQ on UK TOP SECRET COMINT 61 Client running a TOR button software TOR network user-agent changed VPS with an proxy set open web-proxy verifying user-agent Figure 1 Our test data collection infrastructure we control the two shaded hosts and run packet logging on these hosts we connect to the public TOR network guard node and exit node are shown using default settings of the TOR button we then browse public web sites On our client we used the standard button web browser extension to access TOR with two minor modi cations we changed the internal TOR button proxy polipo to use our open HTTP proxy after transiting TOR Secondly we changed the web browser User Agent to match that accepted by our open HTTP proxy Four experiments of about ten minutes each were conducted 1 News Search on Bing for news followed by browsing BBC News and then the W ashington Post 2 browsing the TOR website and then using a privacy checking website 3 Download visit to SlashDot followed by downloading a large PDF le 4 Forum Search on Google followed by browsing a PC technical help forum Packet logs were collected for each of these In each experiment one guard node and one exit node was used apart from in experiment 2 where TOR set up a new path during the experiment with a change in both guard node and exit node Therefore we will generally split experiment 2 into the two circuits 2a and 2 2 Bulk data To prove a low false alarm rate of any algorithm we need bulk data we used IP address dumps from four internet bearers two were client server bearers and two were server client bearers The data was kindly collected by GTE with their high speed 5 This information is exempt under the Freedom of Information Act 2000 FOIA and may be exempt under other UK information legislation Refer any FOIA queries to GCI 1Q on UK TOP SECRET COMINT UK TOP SECRET COMINT 61 data recorder The timestamps are microsecond accurate but the data comes with no packet sizae protocol or port information Each capture was two hours long totalling eight hours of collection we identify TOR traf c by use of consensus logs which are imported by NE The TOR network decides what nodes are part of the network and what their roles are we lter the SIGINT packet traces to packets that satisfy 0 One IP address being a guard node as identi ed in the consensus log closest to the data collection period and The other IP address not being seen any the consensus log from a month surround ing that period This approach will inaccurately identify traf c between clients and guard nodes the inaccuracy comes from the fact that the guard nodes may be doing other non TOR communication we therefore believe that we will approximately lter down to a superset of the wanted traf c we note that these ltering rules mean that we will ignore any clients that also run a TOR node However this constraint will still apply for a potential operational scenario where we also run guard nodes without full knowledge of the client one can not tell whether a circuit is terminating in the client or being relayed through its TOR node 3 Why we are not tracking a circuit through the TOR network When we started this work we considered attacks that tried to follow data through the TOR network However an initial experiment based on JTRIG browsing logs showed that SIGINT visibility was too low for a signi cant chance of success we will now describe the experiment conducted but the rest of the section is not required reading we compare two sources of data 0 Communication between TOR servers seen in SIGIN T TDSD deployed an ICTR signature in Squeal to detect communication between TOR nodes across the SIGIN fleet for two days in December 2010 To remove signature false hits we lter down to hits where both IP addresses are known to be TOR routers by consensus logs from the same period 0 TOR links used by JTRIG for a similar period JTRIG log the usage of TOR connections Each of their circuits typically have two links between TOR routers 6 This information is exempt under the Freedom of Information Act 2000 FOIA and mav he exem 3t under other UK information legislation Refer any FOIA queries to GCI 1Q on UK TOP SECRET COMINT 61 There were 1958 TOR routers in the relevant consensus les of which 1893 had signature hits we consider links between TOR routers as undirected and an all pairs calculations allows us to estimate that there are 1 79 to 1 92 million possible links in the TOR network we do not know whether we do not see a router due to it being off or invisible to us However we only see 6185 links between TOR routers in the SIGIN logs Therefore we are seeing about 0 3% of possible TOR links in SIGIN T Note this percentage is likely to be an small underestimate of the visibility of TOR links as most TOR circuits will consist of two links within the TOR network one link involving a guard node and one link involving an exit node Based on the consensus logs there are 1 65 million links that satisfy this constraint this link count raises visibility to The JTRIG logs use 8294 links between TOR nodes of which we see 13 Therefore we could only sec 0 16% of JTRIG used TOR links in SIGIN T The JTRIG link visibility percentage and the population link visibility percentage suggest a signi cantly different underlying distribution p value between 0 005 and 0 01 TOR is claimed to use a server s geolocation to choose circuits perhaps this circuit choosing algorithm hinders our visibility If we assume the visibility of each link in a 2 path within the TOR network is inde pendent then the chance of seeing any particular 2 path is between 1 in 100 000 and 1 in 400 000 In the JTRIG logs of 2935 paths we would expect to observe both links of one or more paths with probability 0 7% to we saw one so either we were lucky or link visibility is correlated in a path To complete a circuit trace you would also need to see the client to guard node traf c and the exit node to server traf c which would lower the probability of visibility yet further We believe such probabilities are too low for a useful attack and thus we do not consider circuit tracing attacks further 4 Correlating guard and exit node traf c we will use the correlation in the timing of data packets between guard and exit nodes to deanonymise TOR There are two features that we need to remember about TOR TOR repacketises the data which means that packets and packet sizes can not be exactly followed through the network 0 Each TOR node has a rate limiting store and forward methodology which will af fect packet timings These two factors means that a burst of packets entering the TOR network will be smeared out over time into cells that do not directly map to the input packets and comparing histograms of guard and exit traf c is hard 7 This information is exempt under the Freedom of Information Act 2000 FOIA and mav be exem 3t under other UK information legislation Refer any FOIA queries to GCHQ on UK TOP SECRET COMINT Our main insight is to consider cumulative distributions rather than histograms In particular we will consider cumulative packet count and cumulative TCP payload bytes we hope that over a long time window that there will be approximate conservation of these quantities and the impact of rate limiting will be less signi cant rate limiting will just impose a maximum gradient In gure 2 and gure 3 we show cumulative packet counts and cumulative TCP payload sizes for our truth data It can be seen that a lot of features are preserved it seems sensible to consider that comparison of these cumulative plots between guard and exit node traf c could lead to a deanonymisation technique Considering gure 2 in more detail one can see steps in these plots replicated between guard node and exit node traf c for all the plots apart from perhaps CtoS 4 perhaps client to server traf c is harder to deanonymise You can also see that there are changes in scale for example in experiment 3 there are more guard node packets than exit node packets Figure 3 of cumulative TCP payloads shows some similar features All the server to client graphs show preserved steps However the client to server graphs do not show very clear relationships In particular looking at CtoS you can see a big difference during the download of the large PDF le The exit node is sending TCP ACKs with no payload back to the server but the client is sending larger TOR acknowledgement cells back which leads to a large difference in graph shape Therefore we will consider deanonymisation based on cumulative packet counts As well as looking like a generally cleaner picture than using TCP payload such an approach requires less data to be collected by the SIGINT system This easier data collection is exempli ed by our bulk data logs these logs do not contain packet sizes let alone TCP payload sizes 5 Why we want to own the exit node SIGIN gives us two added complications that the above truth data analysis above has ignored 1 Each exit node is serving many clients which will add unwanted data into the traf c for a single user TOR users are encouraged to run privacy preserving proxies g privoxy or polipo and these proxies try to normalise the traf c so that ev eryone shares the same HTTP header format The impact of this is that SIGINT proxy demultiplexing techniques can not reliably be used It is potentially possible to try and demultiplex traf c via content analysis looking at what links are included in webpages but such a technique would be complex and fragile to HTML formatting errors and require all web browsing to be 8 This information is exempt under the Freedom of Information Act 2000 FOIA and mav be exemJt under other UK information legislation Refer any FOIA queries to GCHQ on UK TOP SECRET COMINT 61 Cumulative packet count CtoS 1 StoC 1 010002500 I I I I I 010002500 I I I I 15 46 15 48 15 50 15 52 15 54 15 56 15 46 15 48 15 50 15 52 15 54 15 56 CtoS 28 StoC 16 04 16 06 16 02 16 04 16 06 2b StoC 2b 16 02 16 04 16 06 16 08 16 02 16 04 16 06 16 08 CtoS 3 StoC 3 Cumulative packet count 0 2000 5000 0 100 300 I I 0 4000 0 200400 16 12 16 14 16 16 16 18 16 12 16 14 16 16 16 18 CtoS 4 StoC 4 0_ Ln 16 22 16 24 16 26 16 28 16 22 16 24 16 26 16 28 Time Figure 2 Cumulative packet count for the four triithed sessions On the left hand side the client to server traf c is shown On the right hand side the server to client traf c is shown The second truthed session is split into the two signi cant circuits used 9 This information is exempt under the Freedom oi Information 2000 and max be e em it under other UK information legislation Refer an FOIA queries to GCI K on UK TOP SECRET COMINT UK TOP SECRET COMINT 61 Cumulative payload Exit Guard CtoS 1 StoC 1 C a 8 8 C 0 15 46 15 48 15 50 15 52 15 54 15 56 15 46 15 48 15 50 15 52 15 54 15 56 8 CtoS 2a StoC 16 02 16 04 16 06 16 02 16 04 16 06 2b StoC 2b 16 2 16 4 1 06 1 08 1 02 16 4 1 06 1 08 CtoS 3 StoC 3 Cumulative TCP payload bytes 020000 0e 003 05 I 1 12 1614 1 16 1618 1 12 1 14 1616 1 18 CtoS 4 StoC 4 0e 003e 05 II I II 06000000 1622 1 24 1626 1628 1622 1624 1626 1628 0 100000 I I I 0e 00 8e 05 I II II Time Figure 3 Cumulative payload bytes for the four truthed sessions On the left hand side the client to server traf c is shown On the right hand side the server to client traf c is shown The second truthed session is split into the two signi cant circuits used 10 This information is exempt under the Freedom of Information information legislation Refer an FOIA queries to GCI K on UK TOP SECRET COMINT UK TOP SECRET COMINT 2 Limited SIGINT coverage will mean we will only see a subset of the exit node traf c The exit node could be contacting servers anywhere on the internet and unless we have collection very close to the exit node we are likely to only see a subset of these connections A potential approach around these problems is to demultiplex by HTTP server We hope that in a time window there might only be one user browsing a particular website so demultiplexing by server will therefore demultiplex by user too We also hope for complete collection of some of these streams as packets will hopefully follow a small number of routes between the exit node and the HTTP server We show the in uence of demultiplexing by HTTP server in gure 4 It can be seen that the clear picture in the previous section is no longer present a modern webpage is typically made up of content from many different HTTP servers By it can be seen that it is hard to link many of these fragments of traf c per HTTP server to the guard node traf c Indeed using the mathematical technique we will describe in the following section only four of the server to client HTTP traces could be successfully linked to guard node traf c We therefore recommend that we analyse traf c from exit nodes that we control Such an approach has several desirable features 0 We can demultiplex exit node traf c by client as the TOR exit node can associate all traf c with a circuit Complex and potentially inaccurate demultiplexing algorithms are not required 0 We collect all exit node traf c for each client circuit opposed to SIGIN which may only give fragments of the traf c We therefore have the cleanest possible data 0 We hope to deanonymise all traf c for a circuit rather than just fragments of the traf c which may not contain the data of interest to an analyst A resultant database of TOR traf c would give a simple place to experiment and deploy TOR deanonymisation analytics The NSA have already demonstrated the demultiplexing of traf c in a TOR exit node 6 Score We compare an exit trace and a guard node trace using basic linear regression building on a suggestion of We bin time up into 1 second bins 1 11 This information is exempt under the Freedom of Information Act 2000 FOIA and may be exempt under other UK information legislation Refer any FOIA queries to GCI K on UK TOP SECRET COMINT UK TOP SECRET COMINT Guard 61 Cumulative packet count per host Guard secondary Exit a fsdn c0m Exit analyze privacy net EXIt forums techguy 0rg Exit media3 washingtonpost com Exit media washingt0npost com Exit news bbcimg co uk CtoS 1 8 StoC 1547 - lizlgi i 15 46 15 48 15 50 15 52 15 54 15 56 15 46 15 48 15 50 15 52 15 54 15 56 CtoS 2 StoC 16 02 16 04 16 06 16 08 16 02 16 04 16 06 16 08 a 2 8 Ct05 StoC 416 12 16 14 16 16 16 18 16 12 16 14 16 16 16 18 8 CtoS 4 StoC 4 LO c 00 8 - 16 22 16 24 16 26 16 28 16 22 16 24 16 26 16 28 Time Figure 4 Cumulative packet count for the four truthed sessions but with exit traf c split across the main hosts contacted On the left hand side the client to server traf c is shown On the right hand side the server to client traf c is shown 12 UK TOP SECRET COMINT UK TOP SECRET COMINT 61 this choice is arbitrary but does match the timing precision reported by most SIGIN systems In each bin we then have a pair Eh where E2 is the cumulative exit node packet count at time 7i and GE is the cumulative guard node packet count at time Our primary measure of association is the correlation between E2 and GE as measured by the Pearson product moment correlation coef cient r We expect that when the two traf c streams are associated that a very high correlation coef cient will result we note that high correlation coef cients are generally expected as we re comparing two increasing sequences To remove false positives we also t the following basic linear model 2 we then use 3 to discard bad matches We expect that a similar number of packets are expected in both traces we only consider traces where 1 2 3 2 A key consideration is how to handle guard node and exit node traces that do not precisely overlap We are assuming that our exit node trace is the complete log of a TOR circuit On the other hand the guard node trace may contain traf c for other TOR circuits We therefore truncate the guard node trace to the period of the exit node trace we also pad the guard node packet counts with zero if required to span the exit node trace In addition we consider score all possible time shifts of the guard node trace against the exit node trace i e trying out different timing offsets we do this as we do not know the clock offset in our truth data set and to be as generous as possible in nding false positives in the bulk logs It is not known whether such a sliding approach may prove useful in practice with known clock offsets it is possible that such an approach may also combat unknown delays in the TOR network 7 Results we rst show results of the proposed technique on our truth data set and then consider false positive rates As mentioned above we will slide the putative guard node and exit nodes traces against each other to allow for an unknown clock offset between the client and HTTP proxy in the truth data set and to give us the maximum number of comparisons when trying to the false positive rate 7 1 True positives in the truth data The rst experiments are to understand the behaviour of the algorithm on the truth data In gure 5 we show plots of r2 and 3 as the time offset is varied It can be seen 13 This informaiion is exempL under Lhe Freedom of Informaiion A 7 informaiion legislaiion Refer any FOIA queries Lo GCHQ on UK TOP SECRET COMINT 61 beta r 2 500 0 500 CtoS 1 StoC 1 2 0 1 5 1 0 0 5 0 0 CtoS 2a StoC 2a 2 0 1 5 1 0 fut iiLit 0 0 CtoS 2b StoC 0 0 CtoS 3 StoC 3 2 0 1 5 1 0 0'5 0 0 CtoS 4 StoC 500 500 Time offset I Figure 5 The correlation r2 and the linear regression coefficient 3 as the time offset between the guard node and exit node traffic is varied 14 This information is exempt under the Freedom of Information v ct 2000 POM and may be exempt under other UK information legislation Refer any FOIA queries to GCI K on UK TOP SECRET COMINT UK TOP SECRET COMINT 61 Sample I Client to server Server to client 1 0 9977 0 9991 2a 0 9238 0 9464 2b 0 9957 0 9983 3 0 9952 0 9998 4 0 9975 0 9983 Table 1 The maximum values of r2 achieved on the truth data Server to client Server to client 0 0 0 2 0 4 0 6 0 8 1 0 19 06 19 04 19 02 1e 00 False positive rate False positive rate Figure 6 The ROC curve for the server to client direction false positive rate shown on a linear axis in the left plot and on a logarithmic axis in the right plot that correlations close to 1 are achieved at time offsets close to 0 The short circuit 2a fails to achieve a high correlation Sensible values for 8 also result in particular in the server to client direction The maximum value of the correlation is also shown in table 1 and it can be seen that higher correlations are achieved in the server to client direction 7 2 False positives in the bulk data We now compare our truth exit traces to the bulk data set and see if we nd any false positives As previously mentioned we re sliding time to allow as many traces as possible to be included In the server to client direction we can choose a threshold of 0 998 such that we only miss one true positive whilst having no false positives The one circuit we miss is 2a where the session is short the main activity is only over 33 seconds The ROC curve looks very good gure 6 15 This information is exempt under the Freedom of Information Act 2000 FOIA and may be exempt under other UK information legislation Refer any FOIA queries to GCI K on UK TOP SECRET COMINT UK TOP SECRET COMINT 61 Client to server Client to server 0 0 0 2 0 4 0 6 0 8 1 0 19 06 19 04 19 02 ie 00 False positive rate False positive rate Figure 7 The ROC curve for the client to server direction false positive rate shown on a linear axis in the left plot and on a logarithmic axis in the right plot In the client to server direction the true and false cases are not so well separated see gure 7 we are not able to detect any true positives without some false hits This poor separation is probably due to there being less distinctive structure in the client to server direction HTTP requests share a similar size In the server to client direction a threshold of 0 998 nds the true cases except without any false hits we can try to extrapolate what this false hit rate means to SIGIN collection It is likely that one would restrict the time slide we would imagine a maximum of 5 seconds may be sensible as opposed to the 2 hours in each bearer we have allowed If we imagine restacking our 5circuits 4bearers 120minutes of collection into 288 circuits 100 bearers 10 seconds you could imagine it is possible we could run 288 test circuits against an aperture of 100 bearers without false hitting when allowing ourselves to slide the time window by 5 seconds With our current test data we can t promise a lower false positive rate and further experimentation with sustained collection would be required to better evaluate a false positive rate 8 Software we provide a reference implementation of the algorithm as an package owcompare Three functions are provided format owdata Formats data to that used internally in the algorithm 16 This information is exempt under the Freedom of Information Act 2000 FOIA and mav be exem 3t under other UK information legislation Refer any FOIA queries to GCI 1Q on UK TOP SECRET COMINT 61 score owpair Scores a pair of ows at all time offsets or at a limited range of time offsets ndbestmatching ow Find the guard ow that best matches an exit How We also include our truth data as a data frame called tortruth This package is available 9 Conclusion We have a shown a technique that can deanonymise TOR given packet times between the client and guard node and packet times from the exit node ltered to a single circuit The false positive rate looks low enough to suggest this technique should be carried forward The required data is not collected at present For this technique to work the following additional data feeds will be required 0 Second accurate packet logging at TOR exit nodes we control with packets labelled by a unique circuit identi er 0 Second accurate packet logging of sessions between TOR clients and TOR guard nodes This data could be obtained by SIGIN or by running guard nodes The SIGIN solution would require an up to date feed of TOR consensus documents TOR IP addresses could then be extracted from the consensus documents for ltering by the SIGINT system At the time of writing TRIG are investigating the collection of the exit node data and are trialling a feed of guard node data from research bearers Wider testing is recommended to better characterise the false positive rate A rst test case both because we have truth data and for InfoSec purposes may be to try and deanonymise JTRIG TOR usage References Various TOR resources linked off SA TOR capability reported by Options for further research into the Tor network 7694BA 5001 3 104 2 February 2010 DISCOVER 6722909 17 This information is e empr under the Freedom of Information 3000 FOIAJ and mm be e empr under other lli information legislnricm Refer inv queries Lo GCI on UK TOP SECRET COMINT UK TOP SECRET COMINT THIS PAGE IS INTENTIONALLY LEFT BLANK 18 UK TOP SECRET COMINT I National Security Archive Suite 701 Gelman Library The George Washington University 2130 H Street NW Washington D C 20037 Phone 202 994‐7000 Fax 202 994‐7005 nsarchiv@gwu edu
OCR of the Document
View the Document >>