Status Update
Comments
co...@gmail.com <co...@gmail.com> #2
I don't know much about the Mac device detection code, but my $0.02:
Most device detection issues that I encounter on Windows are caused by bad cables or cable extensions. It makes me contemplate whether there should be a 'adb usbtest' command to stress-test cables.
I'm surprised that there is a pipe stall. I don’t think I've ever encountered this on Windows. In fact, adb Windows doesn’t even contain the API call to clear a stall (technically if one of my patches goes in, it will have the call to reset the pipe, but that will only be used in the case of resuming from sleep/hibernation).
Maybe this is a clue that the Mac device detection code should reset the pipe very early in device detection? I'm not sure. BTW, the first Google hit for 'usb pipe stall' points to a Mac mailing list:http://lists.apple.com/archives/usb/2004/Dec/msg00082.html
I'm pretty suspicious that after the pipe stall is cleared, that the read comes back with data with a bad header ('invalid magic').
I'm not so convinced that the logs suggest a race condition (but I could be wrong).
Chrome has its own embedded implementation of adb that uses libusb. If it's possible, you might want to try it to see if it has the same problem.https://developer.chrome.com/devtools/docs/remote-debugging I couldn't see an easy way to do your repro with it though.
Most device detection issues that I encounter on Windows are caused by bad cables or cable extensions. It makes me contemplate whether there should be a 'adb usbtest' command to stress-test cables.
I'm surprised that there is a pipe stall. I don’t think I've ever encountered this on Windows. In fact, adb Windows doesn’t even contain the API call to clear a stall (technically if one of my patches goes in, it will have the call to reset the pipe, but that will only be used in the case of resuming from sleep/hibernation).
Maybe this is a clue that the Mac device detection code should reset the pipe very early in device detection? I'm not sure. BTW, the first Google hit for 'usb pipe stall' points to a Mac mailing list:
I'm pretty suspicious that after the pipe stall is cleared, that the read comes back with data with a bad header ('invalid magic').
I'm not so convinced that the logs suggest a race condition (but I could be wrong).
Chrome has its own embedded implementation of adb that uses libusb. If it's possible, you might want to try it to see if it has the same problem.
vs...@google.com <vs...@google.com> #3
I looked at the logs some more, and here's my understanding so far:
1. There is a pipe stall after the initial SYNC is sent.
2. The recovery code simply clears the stall and attempts a read again. From what I can see, this just receives some data, and it isn't guaranteed to have returned the correct values anyway.
3. The transport quits when it detects invalid data, so at this point it is like have nothing connected.
4. Disconnecting and reconnecting a device will kick off a new transport, and the device is again detected. This happens via AndroidInterfaceNotify calling AndroidDeviceAdded which seems to register_usb_transport.
First, I'm not sure why there would be a pipe stall so early. Maybe it was because of some missing cleanup when exiting the previous time.
Second, we need to figure out how to safely recover from this stall, rather than just giving up on that transport..Or maybe we give up on it, but recreate another transport. I'll see whatever is the easiest one to implement.
Spencer - you mention resetting the pipe at the beginning - could you point me to that CL?
1. There is a pipe stall after the initial SYNC is sent.
2. The recovery code simply clears the stall and attempts a read again. From what I can see, this just receives some data, and it isn't guaranteed to have returned the correct values anyway.
3. The transport quits when it detects invalid data, so at this point it is like have nothing connected.
4. Disconnecting and reconnecting a device will kick off a new transport, and the device is again detected. This happens via AndroidInterfaceNotify calling AndroidDeviceAdded which seems to register_usb_transport.
First, I'm not sure why there would be a pipe stall so early. Maybe it was because of some missing cleanup when exiting the previous time.
Second, we need to figure out how to safely recover from this stall, rather than just giving up on that transport..Or maybe we give up on it, but recreate another transport. I'll see whatever is the easiest one to implement.
Spencer - you mention resetting the pipe at the beginning - could you point me to that CL?
en...@google.com <en...@google.com> #4
(i don't think "pipe stall" means the pipe is full --- i think it means there's an error condition to look at.)
co...@gmail.com <co...@gmail.com> #6
SYNC is not actually sent over the wire (see protocol.txt), so that's why I don't think it has anything to do with USB pipes.
How about this idea: when the USB interface is originally opened, call the API to clear any potential pipe stall. In other words, do this early before usb_read()/usb_write() is even called. Or will that mess up things? (I don't know if the device immediately sends something over the wire when it connects or whether the device starts out 'idle').
My two changes touch Windows-only code to reset the pipe when resuming:https://android-review.googlesource.com/161162 https://android-review.googlesource.com/161114 . I'm not sure if this will be useful to you.
How about this idea: when the USB interface is originally opened, call the API to clear any potential pipe stall. In other words, do this early before usb_read()/usb_write() is even called. Or will that mess up things? (I don't know if the device immediately sends something over the wire when it connects or whether the device starts out 'idle').
My two changes touch Windows-only code to reset the pipe when resuming:
vs...@google.com <vs...@google.com> #7
Here's a total hack that seems to fix the problem for me: https://android-review.googlesource.com/#/c/165821/1/adb/transport.cpp .
I think Spencer's idea of either clearing the pipe stall, or a "reset" of that pipe (if such a thing exists) before we start might fix the problem as well.
I see pros and cons to both solutions (and maybe both are needed?): I think both are good ideas (clearing the slate initially, and adding a bit more of recovery code to handle errors/stalls when they do occur)
enh - How do you want to proceed? Maybe you can do the fix in a better way and I can get it tested within the team?
I think Spencer's idea of either clearing the pipe stall, or a "reset" of that pipe (if such a thing exists) before we start might fix the problem as well.
I see pros and cons to both solutions (and maybe both are needed?): I think both are good ideas (clearing the slate initially, and adding a bit more of recovery code to handle errors/stalls when they do occur)
enh - How do you want to proceed? Maybe you can do the fix in a better way and I can get it tested within the team?
vs...@google.com <vs...@google.com> #8
I spent a few hours installing OS X on VMWare Fusion in the hopes that VMWare can capture the USB Logs. But when running under VMWare, this is super reliable - adb never seems to see a stall.
So that experiment gives very little info - maybe it is the additional delay due to VMWare, or it is the VMWare USB stack, but either way, no issues are seen when running under VMWare.
So that experiment gives very little info - maybe it is the additional delay due to VMWare, or it is the VMWare USB stack, but either way, no issues are seen when running under VMWare.
vs...@google.com <vs...@google.com> #9
Attached are some packet captures using TotalPhase Beagle. Can be read using the DataCenter software from http://www.totalphase.com/products/data-center/
working-trace: shows a trace where everything works normally.
packet-71-ignored-76-invalid: shows a trace where adb drops a transport because of a stall.
The capture actually doesn't show a stall - it is only the OS X Usb stack which seems to report this stall. Because of the reported stall, adb ignores the AUTH packet, and then attempts to read the 20 bytes token that is sent in the next packet (76) and attempts to parse it as a packet, which of course fails.
Next steps:
- Apple provides a logging version of the IOUsbKit, but that is not released as yet for OS X 10.10. This could possibly shed more light into why a stall is reported..
- I still think a valid solution is to either reset the state machine, and start again from the host sending a CNXN, or follow the Linux/Windows approach of attempting to re-discover connected USB devices every second (or after every unexpected drop).
working-trace: shows a trace where everything works normally.
packet-71-ignored-76-invalid: shows a trace where adb drops a transport because of a stall.
The capture actually doesn't show a stall - it is only the OS X Usb stack which seems to report this stall. Because of the reported stall, adb ignores the AUTH packet, and then attempts to read the 20 bytes token that is sent in the next packet (76) and attempts to parse it as a packet, which of course fails.
Next steps:
- Apple provides a logging version of the IOUsbKit, but that is not released as yet for OS X 10.10. This could possibly shed more light into why a stall is reported..
- I still think a valid solution is to either reset the state machine, and start again from the host sending a CNXN, or follow the Linux/Windows approach of attempting to re-discover connected USB devices every second (or after every unexpected drop).
co...@gmail.com <co...@gmail.com> #10
I wonder if Chrome's adb implementation has this problem. Maybe they solved it? It sounds like a pretty general problem that you hit that others must be hitting as well.
ba...@google.com <ba...@google.com> #11
This got pointed to by a post to the Apple USB list. The USB use of "STALL" in the Apple APIs is ambiguous, it can mean that a STALL PID was returned, or that the pipe is already STALLed because of an error on an earlier transaction.
You may need to be looking at what happened on the earlier transactions. I'm not sure I can decode your log, but you may need to be looking further back, which isn't included in the comments.
Given the description, my guess would be a random error caused by bad cabling. Which would most likely show up as a kIOReturnNotResponding error.
You may need to be looking at what happened on the earlier transactions. I'm not sure I can decode your log, but you may need to be looking further back, which isn't included in the comments.
Given the description, my guess would be a random error caused by bad cabling. Which would most likely show up as a kIOReturnNotResponding error.
vs...@google.com <vs...@google.com> #12
Thanks for the response Barry. I've attached a screenshot of the protocol analyzer reading the above capture.
The packet at index 71 is what isn't read by the host (and instead returns the stall error). But on the wire, the data is sent properly. About the only possible error on the wire is the red line at index 5, but that seems like something else to me..
The packet at index 71 is what isn't read by the host (and instead returns the stall error). But on the wire, the data is sent properly. About the only possible error on the wire is the red line at index 5, but that seems like something else to me..
ba...@google.com <ba...@google.com> #13
Are you sure it's transaction 71, not 76 which is the transaction in question?
My guess is that transaction 71 is dropped on the floor due to a data toggle error, and that then transaction 76 is received. Though as this transaction is ACKed its a puzzle where the STALL comes from.
What's the host? Is this an XHCI host? It makes me wonder about the data toggle handling in the XHCI driver. (I wrote the XHCI driver.)
The bus idle time between transaction 71 and 76 is unreasonably short, unless the controller is an XHCI controller and just threw away transaction 71.
The fact that both IN and OUT pipes start with a DATA1 packets is suspicious. A trace starting when the device is plugged in might be more illuminating. That would tell you if there were previous toggle errors which might add up to this. On its own, the excerpt isn't enough.
My guess is that transaction 71 is dropped on the floor due to a data toggle error, and that then transaction 76 is received. Though as this transaction is ACKed its a puzzle where the STALL comes from.
What's the host? Is this an XHCI host? It makes me wonder about the data toggle handling in the XHCI driver. (I wrote the XHCI driver.)
The bus idle time between transaction 71 and 76 is unreasonably short, unless the controller is an XHCI controller and just threw away transaction 71.
The fact that both IN and OUT pipes start with a DATA1 packets is suspicious. A trace starting when the device is plugged in might be more illuminating. That would tell you if there were previous toggle errors which might add up to this. On its own, the excerpt isn't enough.
ba...@google.com <ba...@google.com> #14
To which end it'd be interesting to know what ends up in the buffer of the transfer which receives the error. It might just have the bytes from one packet or the other.
vs...@google.com <vs...@google.com> #15
Thank you again for your help Barry!
> Are you sure it's transaction 71, not 76 which is the transaction in question?
Yes, if you look at the details pane in the screenshot, packet 71 shows that it is an AUTH packet. We miss out on reading this packet.
> What's the host? Is this an XHCI host?
Sorry, I'm a newbie here, I've been looking at this issue for hardly 2 days. This is on a Macbook Pro (Retina, Early 2013). The system report says:
Host Controller Location: Built-in USB
Host Controller Driver: AppleUSBEHCI
> The bus idle time between transaction 71 and 76 is unreasonably short, unless the controller is an XHCI controller and just threw away transaction 71.
The way the Android remote device writes out these 2 packets is like this:
usb_write(packet)
if (packet has data) {
usb_write(packet->data);
}
At a higher level, these 2 USB packets form a single higher level protocol packet and are written out consecutively. I don't know if that explains how the capture looks though.
> A trace starting when the device is plugged in might be more illuminating. That would tell you if there were previous toggle errors which might add up to this. On its own, the excerpt isn't enough.
I'll see if I can get such a capture. The issue we have is not about when the device is plugged in, but happens in between during a sequence of actions.
> To which end it'd be interesting to know what ends up in the buffer of the transfer which receives the error.
What I see in the buffer is:
1. first usb_read, which is supposed to read packet 71 returns a IO Stall error. The buffer hasn't been touched and still has zeroes.
2. After clearing the stall, the subsequent read returns the correct data from packet 76.
> Are you sure it's transaction 71, not 76 which is the transaction in question?
Yes, if you look at the details pane in the screenshot, packet 71 shows that it is an AUTH packet. We miss out on reading this packet.
> What's the host? Is this an XHCI host?
Sorry, I'm a newbie here, I've been looking at this issue for hardly 2 days. This is on a Macbook Pro (Retina, Early 2013). The system report says:
Host Controller Location: Built-in USB
Host Controller Driver: AppleUSBEHCI
> The bus idle time between transaction 71 and 76 is unreasonably short, unless the controller is an XHCI controller and just threw away transaction 71.
The way the Android remote device writes out these 2 packets is like this:
usb_write(packet)
if (packet has data) {
usb_write(packet->data);
}
At a higher level, these 2 USB packets form a single higher level protocol packet and are written out consecutively. I don't know if that explains how the capture looks though.
> A trace starting when the device is plugged in might be more illuminating. That would tell you if there were previous toggle errors which might add up to this. On its own, the excerpt isn't enough.
I'll see if I can get such a capture. The issue we have is not about when the device is plugged in, but happens in between during a sequence of actions.
> To which end it'd be interesting to know what ends up in the buffer of the transfer which receives the error.
What I see in the buffer is:
1. first usb_read, which is supposed to read packet 71 returns a IO Stall error. The buffer hasn't been touched and still has zeroes.
2. After clearing the stall, the subsequent read returns the correct data from packet 76.
vs...@google.com <vs...@google.com> #16
Attached a capture that shows some more correct sequences going on before the issue crops up.
In the attached screenshot:
1939: host is telling the device to close the current session via a CLSE packet
2002+2007: host is now starting a new connection via an CNXN packet
2015+2020: device responds back asking the host to authorize with an AUTH packet
The host doesn't read packet 2015 (sees a pipe stall instead), but reads packet 2020 correctly.
In the attached screenshot:
1939: host is telling the device to close the current session via a CLSE packet
2002+2007: host is now starting a new connection via an CNXN packet
2015+2020: device responds back asking the host to authorize with an AUTH packet
The host doesn't read packet 2015 (sees a pipe stall instead), but reads packet 2020 correctly.
ba...@google.com <ba...@google.com> #17
OK, if I'm thinking of the right machine, that might be able to do it as well. A system profiler would help check I'm thinking of the right machine. Its got an XHCI, but the high speed traffic is partitioned off to the EHCI controller, which makes it a bit faster. By the time of system version you're using there was also an optimisation which made the EHCI faster.
The question isn't what the Android device is doing, but what the Mac host is doing. A Mac host can't complete one transaction and initiate another in 2µs (the bus idle between 75 and 76), it must be purely the controller which is doing this.
I'm pretty sure an EHCI controller can't complete one transaction and initiate another in 2µs, even if both were queued on the controller at once. That is if the host driver were using overlapped async transactions. So that's one question, is the host driver using async transactions?
To get a 2µs gap between 75 and 76, the controller most likely must have thrown away the packet in 74, so it issues the next IN almost immediately. Discarding a packet while also ACKing that packet is a classic sign of a data toggle issue. As the packet in 74 is a DATA1 packet, the suspicion is that the host was expecting a DATA0 instead.
The host and device could be mismatched like this for a couple of reasons, all of which depend on what happened previously. There could have been an error earlier and the host had its STALL cleared, which reset the expected toggle, but no one told the device about this. There is also a problem if the host driver is a user space driver and is closed an opened. The host can have its toggle cleared, and a lot of times no one thinks to tell the device. So another question is this a kext or user space driver?
As you note that "After clearing the stall, the subsequent read returns the correct data from packet 76." There is no time for, or evidence of, on the bus of a STALL being cleared. So the data seen after that probably doesn't come from transaction 76, but a retransmission.
Extending the capture past this time until the retry has happened might also be useful.
The question isn't what the Android device is doing, but what the Mac host is doing. A Mac host can't complete one transaction and initiate another in 2µs (the bus idle between 75 and 76), it must be purely the controller which is doing this.
I'm pretty sure an EHCI controller can't complete one transaction and initiate another in 2µs, even if both were queued on the controller at once. That is if the host driver were using overlapped async transactions. So that's one question, is the host driver using async transactions?
To get a 2µs gap between 75 and 76, the controller most likely must have thrown away the packet in 74, so it issues the next IN almost immediately. Discarding a packet while also ACKing that packet is a classic sign of a data toggle issue. As the packet in 74 is a DATA1 packet, the suspicion is that the host was expecting a DATA0 instead.
The host and device could be mismatched like this for a couple of reasons, all of which depend on what happened previously. There could have been an error earlier and the host had its STALL cleared, which reset the expected toggle, but no one told the device about this. There is also a problem if the host driver is a user space driver and is closed an opened. The host can have its toggle cleared, and a lot of times no one thinks to tell the device. So another question is this a kext or user space driver?
As you note that "After clearing the stall, the subsequent read returns the correct data from packet 76." There is no time for, or evidence of, on the bus of a STALL being cleared. So the data seen after that probably doesn't come from transaction 76, but a retransmission.
Extending the capture past this time until the retry has happened might also be useful.
vs...@google.com <vs...@google.com> #18
Barry followed up offline and he deduced that issuing a ClearStallBothEnds should fix the issue. That fix is here: https://android-review.googlesource.com/#/c/166740/
From his email: "I saw we posted roughly simultaneously, so I only just noticed the trace with the previous transactions. It looks like something clears the host endpoint between the CLSE and CNXN packets, but doesn’t tell the device about this.
That goes back to my question about user-space vs kernel. I know nothing of your system, I’m just diagnosing the bus behavior. If the host driver is user space driver, the close-open could well cause this. Its a common error."
For reference, the discussion on the Apple mailing list is here:http://lists.apple.com/archives/usb/2015/Aug/msg00021.html
From his email: "I saw we posted roughly simultaneously, so I only just noticed the trace with the previous transactions. It looks like something clears the host endpoint between the CLSE and CNXN packets, but doesn’t tell the device about this.
That goes back to my question about user-space vs kernel. I know nothing of your system, I’m just diagnosing the bus behavior. If the host driver is user space driver, the close-open could well cause this. Its a common error."
For reference, the discussion on the Apple mailing list is here:
a....@gmail.com <a....@gmail.com> #19
When is this feature planned to be released?
Description
$ adb start-server
$ adb devices -l
$ adb kill-server
you can see that adb fails to recognize the device at a reasonably high rate. I'm not exactly sure why, but if the same sequence is done via ddmlib (as is done in Studio/ddms), then the frequency with which devices aren't found is even higher.
I've attached 2 traces of the adb server log, one where it worked fine, and one where it didn't. The differences seem to start fairly early:
dump_packet: 58804:5033734 | 021e9eccf0a69ccc: to remote: [SYNC] arg0=1 arg1=2 (len=0)
output_thread: 58804:5033734 | 021e9eccf0a69ccc: data pump started
fdevent_process: 58804:5033719 | select() returned n=1, errno=0
usb_read: 58804:5033734 | Pipe stalled, clearing stall.
fdevent_process: 58804:5033719 | got events fde->fd=8 events=0001, state=0101
When it works, the above sequence doesn't happen, instead we have:
send_connect: 58788:5033174 | Calling send_connect
dump_packet: 58788:5033188 | 021e9eccf0a69ccc: from remote: [SYNC] arg0=1 arg1=2 (len=0) dump_packet: 58788:5033174 | 021e9eccf0a69ccc: to remote: [CNXN] arg0=0x1000000 arg1=0x40000 (len=7)
686f73743a3a00 host::.
input_thread: 58788:5033188 | 021e9eccf0a69ccc: transport SYNC online
fdevent_loop: 58788:5033174 | --- ---- waiting for events
dump_packet: 58788:5033188 | 021e9eccf0a69ccc: from remote: [CNXN] arg0=0x1000000 arg1=0x40000 (len=7) 686f73743a3a00 host::. input_thread: 58788:5033188 | 021e9eccf0a69ccc: transport got packet, sending to remote
output_thread: 58788:5033189 | 021e9eccf0a69ccc: received remote packet, sending to transport
When it doesn't work, we clear the pipe stall, but it seems like there might be a race condition between SYNC vs CNXN:
send_connect: 58804:5033719 | Calling send_connect
dump_packet: 58804:5033719 | 021e9eccf0a69ccc: to remote: [CNXN] arg0=0x1000000 arg1=0x40000 (len=7) dump_packet: 58804:5033733 | 02
1e9eccf0a69ccc: from remote: [SYNC] arg0=1 arg1=2 (len=0) 686f73743a3a00 host::.
fdevent_loop: 58804:5033719 | --- ---- waiting for events
input_thread: 58804:5033733 | 021e9eccf0a69ccc: transport SYNC online
dump_packet: 58804:5033733 | 021e9eccf0a69ccc: from remote: [CNXN] arg0=0x1000000 arg1=0x40000 (len=7) 686f73743a3a00 host::.
input_thread: 58804:5033733 | 021e9eccf0a69ccc: transport got packet, sending to remote
check_header: 58804:5033734 | check_header(): invalid magic
remote_read: 58804:5033734 | remote usb: check_header failed