Fixed
Status Update
Comments
be...@google.com <be...@google.com>
[Deleted User] <[Deleted User]> #2
Can you reproduce this in the dev_appserver?
(if so program likehttp://guppy-pe.sourceforge.net/#Heapy may help narrow down the problem).
(if so program like
se...@google.com <se...@google.com> #3
Even if u can reproduce on dev_appserver, heapy requires c extension and has to run in process. So i'm not sure it can run in dev appserver without serious tweaks.
[Deleted User] <[Deleted User]> #4
[Comment deleted]
se...@google.com <se...@google.com> #5
[Comment deleted]
se...@google.com <se...@google.com> #6
Do you have any results to share from heapy?
[Deleted User] <[Deleted User]> #7
I've reproduced this issue with testbed [https://developers.google.com/appengine/docs/python/tools/localunittesting ] and did some investigation and memory profiling.
There's a race condition in the ndb stack which prevents a percentage of futures from being marked complete. As a result, they are not removed from the state object, and this keeps the entity entity batch reachable (via a callback closure). Yikes!
I'll post an update when I have more information about the root cause and [hopefully!] a workaround. Note that this leak only affects requests when the in-app cache is disabled or cleared.
Regards,
Isaac
There's a race condition in the ndb stack which prevents a percentage of futures from being marked complete. As a result, they are not removed from the state object, and this keeps the entity entity batch reachable (via a callback closure). Yikes!
I'll post an update when I have more information about the root cause and [hopefully!] a workaround. Note that this leak only affects requests when the in-app cache is disabled or cleared.
Regards,
Isaac
se...@google.com <se...@google.com> #8
Some progress:
The non-determinism observed with fetch_page is due to the iteration order of eventloop.rpcs, which is passed to datastore_rpc.MultiRpc.wait_any() and apiproxy_stub_map.__check_one selects the *last* rpc from the iterator.
Fetching with page_size of 10 does an rpc with count=10, limit=11, a standard technique to force the backend to more accurately determine whether there are more results. This returns 10 results, but due to a bug in the way the QueryIterator is unraveled, an RPC is added to fetch the last entry (using obtained cursor and count=1). NDB then returns the batch of entities without processing this RPC. I believe that this RPC will not be evaluated until selected at random (if MultiRpc consumes it before a necessary rpc), since it doesn't block client code.
For better or worse, direct use of iter() does not have this issue (count and limit will be the same). iter() can be used as a workaround for the performance and memory issues associated with fetch page caused by the above.
Joachim also reports memory increases over an iter() call, but I have not been able to reproduces locally. I don't see what prevents no-longer-needed futures and rpcs from building though, so this could explain his observations. We may be able to fix this by replacing some structures with weakrefs, starting with ndb.tasklets._state.all_pending and ndb.eventloop.rpcs.
https://gist.github.com/isaacl/9226812 has some of my local test setup and you may find it useful for testing ndb feature or benchmarking database funcitons.
Please speak up if you're able to come up with a reproducible leak with query().iter().
Cheers,
Isaac
The non-determinism observed with fetch_page is due to the iteration order of eventloop.rpcs, which is passed to datastore_rpc.MultiRpc.wait_any() and apiproxy_stub_map.__check_one selects the *last* rpc from the iterator.
Fetching with page_size of 10 does an rpc with count=10, limit=11, a standard technique to force the backend to more accurately determine whether there are more results. This returns 10 results, but due to a bug in the way the QueryIterator is unraveled, an RPC is added to fetch the last entry (using obtained cursor and count=1). NDB then returns the batch of entities without processing this RPC. I believe that this RPC will not be evaluated until selected at random (if MultiRpc consumes it before a necessary rpc), since it doesn't block client code.
For better or worse, direct use of iter() does not have this issue (count and limit will be the same). iter() can be used as a workaround for the performance and memory issues associated with fetch page caused by the above.
Joachim also reports memory increases over an iter() call, but I have not been able to reproduces locally. I don't see what prevents no-longer-needed futures and rpcs from building though, so this could explain his observations. We may be able to fix this by replacing some structures with weakrefs, starting with ndb.tasklets._state.all_pending and ndb.eventloop.rpcs.
Please speak up if you're able to come up with a reproducible leak with query().iter().
Cheers,
Isaac
do...@gmail.com <do...@gmail.com> #9
Isaac, many thanks for your investigation and promising results!
- Are you able to reproduce the problem with query().iter() in production? This is the only way I have tested so far.
- Do you have an idea why the value of the batch_size parameter seems to play a role?
- Is there a way to reset the state object completely as a temporary work around? (to allow the objects to be collected)
- Are you able to reproduce the problem with query().iter() in production? This is the only way I have tested so far.
- Do you have an idea why the value of the batch_size parameter seems to play a role?
- Is there a way to reset the state object completely as a temporary work around? (to allow the objects to be collected)
lo...@znerol.ch <lo...@znerol.ch> #10
Joachim, we at Khan Academy are very interested in this bug, since we suspect it may be affecting us as well.
I could reproduce the memory growth locally (on dev-appserver) with the /task/page handler, but when I tried to tweak your example to remove the task queue -- to do the ndb work in get() rather than post() -- it did not use excess memory. Do you know why the task queue is necessary for the memory use bug to emerge?
I could reproduce the memory growth locally (on dev-appserver) with the /task/page handler, but when I tried to tweak your example to remove the task queue -- to do the ndb work in get() rather than post() -- it did not use excess memory. Do you know why the task queue is necessary for the memory use bug to emerge?
lo...@znerol.ch <lo...@znerol.ch> #11
Running in a task or backend is necessary for longer running processes that would exceed the timeout limit of a normal request.
I am seeing memory growth regardless of whether I call iter() or fetch_page()
Any progress towards a resolution?
I am seeing memory growth regardless of whether I call iter() or fetch_page()
Any progress towards a resolution?
lo...@znerol.ch <lo...@znerol.ch> #12
I spent a long time looking into this, using a hacked version of python that I could do memory profiling on. Caveats: this was only on dev-appserver, so I only looked at /task/page, since /task/memory didn't have this same growth.
My conclusion is that everything is working normally, and while the memory grows to 300M+ while executing /task/page and then stays there, that is because of the way memory allocation works.
Basically, the /task/page job requires lots of memory during the request in order to do its job (it's fetching a lot of data; maybe not 300M worth of data but the extra needed memory could be due to the bug that isaac found). When the request is over, python is properly free()ing the memory, but the malloc implementation doesn't return that freed memory back to the OS, it keeps it on its own internal freelists. (All malloc implementations work this way, for efficiency.)
Thus, python is 'using' a lot of memory but not leaking it. When I run the /task/page job again, it stays using 300M the entire time, showing there's not a leak, at least for me.
The prod environment may well be different, so my analysis is not useful for closing this bug one way or the other, but I am confident that, at least in the dev-appserver implementation, there's no memory leak.
My conclusion is that everything is working normally, and while the memory grows to 300M+ while executing /task/page and then stays there, that is because of the way memory allocation works.
Basically, the /task/page job requires lots of memory during the request in order to do its job (it's fetching a lot of data; maybe not 300M worth of data but the extra needed memory could be due to the bug that isaac found). When the request is over, python is properly free()ing the memory, but the malloc implementation doesn't return that freed memory back to the OS, it keeps it on its own internal freelists. (All malloc implementations work this way, for efficiency.)
Thus, python is 'using' a lot of memory but not leaking it. When I run the /task/page job again, it stays using 300M the entire time, showing there's not a leak, at least for me.
The prod environment may well be different, so my analysis is not useful for closing this bug one way or the other, but I am confident that, at least in the dev-appserver implementation, there's no memory leak.
no...@gmail.com <no...@gmail.com> #13
[I meant /task/basic above, not /task/memory]
One thing to point out: we had to put the entire task inside a @ndb.toplevel to avoid memory leaks. Otherwise, there are rogue RPC's might stick around forever, which *would* be a memory leak.
One thing to point out: we had to put the entire task inside a @ndb.toplevel to avoid memory leaks. Otherwise, there are rogue RPC's might stick around forever, which *would* be a memory leak.
[Deleted User] <[Deleted User]> #14
It looks like this bug has gone stale, so I just wanted to pipe in that this is definitely a problem in production and the @ndb.toplevel doesn't help. I ended up with a workaround whereby I dynamically create a db.Expando class from the ndb model to do a keysonly equivalent query. I convert the old db keys to ndb keys and load the entities and do the appropriate work. This resulted in all of our memory errors going away.
ro...@gmail.com <ro...@gmail.com> #15
That sounds like an excellent workaround!
Would you mind sharing some of the code as an example?
Would you mind sharing some of the code as an example?
sa...@gmail.com <sa...@gmail.com> #16
[Comment deleted]
il...@google.com <il...@google.com>
il...@google.com <il...@google.com> #17
It's not a full example but I attached some code with the relevant pieces. I think the code is pretty self explanatory but I'd be happy to elaborate if needed. Hope that helps!
mi...@gmail.com <mi...@gmail.com> #18
I also use @ndb.toplevel in my module get where I upload all my entities in one go using ndb.put_multi(). I have seen my modules being shut down automatically as the memory consumed is sometimes over >150-200MB, although my total of all entities in the (in put_multi) is less than 5-10 MB. I have been forced to use machines with bigger memory, but still for some reason for the past two days even with bigger machines the memory used for the same modules, with almost the same dataset has spiked to >600MB, the _ah/stop starts killing all my module instances. Is there a possibility the memory balloons while the background process which uploads entities stalls?
Is there a work around, till this problem gets fixed?
Error messages:
F 11:06:32.784 Exceeded soft private memory limit with 614.477 MB after servicing 2 requests total
W 11:06:32.784 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
Is there a work around, till this problem gets fixed?
Error messages:
F 11:06:32.784 Exceeded soft private memory limit with 614.477 MB after servicing 2 requests total
W 11:06:32.784 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
il...@google.com <il...@google.com> #19
I use a script to upload/update data on datastore, I query datastore for a certain product Id, then I do some updates to the properties and update it using a put_multi inside a deffered.defer.
To avoid the issue of memory blow up and having the module quit, I used B4_1G. Now I get "Process moved to a different machine" and the module just quits halfway through the update process. I can see that the utilization, memory are within limits, but the module still quits.
I opened another issue 11167 , I strongly suspect these issues are related.
App Id: Shopina-Voyage-001
To avoid the issue of memory blow up and having the module quit, I used B4_1G. Now I get "Process moved to a different machine" and the module just quits halfway through the update process. I can see that the utilization, memory are within limits, but the module still quits.
I opened another
App Id: Shopina-Voyage-001
e....@gmail.com <e....@gmail.com> #20
One additional adder, the problem "Process moved to a different machine" is both with NDB and DB models. I checked with DB and after updating around 22K entities (in ~90 minutes), it fails and the module just quits unexpectedly.
il...@google.com <il...@google.com> #21
You've got a different problem, I'm sure.
ma...@mcassiano.com <ma...@mcassiano.com> #22
lb...@gmail.com <lb...@gmail.com> #23
My symptom of this bug was the following message coming from appstats:
"Found 1 RPC request(s) without matching response (presumably due to timeouts or other errors)"
It took me a while to find this bug based on that, so I'm adding it here so it's searchable.
"Found 1 RPC request(s) without matching response (presumably due to timeouts or other errors)"
It took me a while to find this bug based on that, so I'm adding it here so it's searchable.
ma...@mcassiano.com <ma...@mcassiano.com> #24
We run into this problem every 2 weeks or so. We've started making db peer classes for long running tasks (like Map/Reduce data migrations, for example). It's a crap work around that results in duplication of code and effort. We've even thought about abandoning ndb altogether and just using db, but the cacheing benefits of ndb have kept us using this workaround. Does anyone at google care about this issue anymore? Perhaps google doesn't even use ndb? It's been sitting here for well over a year with 80 stars on it now and no news.
ta...@gmail.com <ta...@gmail.com> #25
Nick, if you look at the code I attached in fix.py, you'll see how you can create the peer class automatically. That helps a little bit with the code bloat/maintenance.
It seems that this should be higher than 'Medium' priority. How can we get that bumped?
It seems that this should be higher than 'Medium' priority. How can we get that bumped?
ma...@mcassiano.com <ma...@mcassiano.com> #26
Curt, fix.py works if you only need to get at the key of the object. I couldn't get it to work and be able to get values of other properties.
mo...@gmail.com <mo...@gmail.com> #27
Kris, well, there isn't a 1-1 mapping between 'ndb' and 'db' properties. So yes, that is problematic. But! The trick is to do a 'db' keys only query, convert the 'db' keys to 'ndb' keys and then finally 'get' or 'get_multi' on the 'ndb' key(s) to get the ndb entity(s). And then you should be set.
cc...@google.com <cc...@google.com> #28
Thanks, Curt, that's what I ended up doing in my mapreduce.
ra...@gmail.com <ra...@gmail.com> #29
I'm seeing this in requests (wrapped with ndb.toplevel) where if ndb.Query.iter() is invoked *at all*, even with a keys-only query, the request causes a memory fault prior to fully completing. It seems even the creation of the generator is enough to cause the fault, even if the generator never actually produces any results (returning immediately after calling ndb.Query.iter() still produces the fault).
I'll try the db-key conversion to see if that helps.
I'll try the db-key conversion to see if that helps.
il...@google.com <il...@google.com> #30
The `ndb` to `db` conversion seems to do the trick for keys-only queries. I'm also finding I have to manually track and flush async requests (even within the bounds of an @ndb.toplevel decorator) to avoid memory faults.
lb...@gmail.com <lb...@gmail.com> #31
Ran into this issue, and created a support case for it. This their response:
"At present, our Team is still working on a fix for this issue. Thought, we still don't have an ETA on resolution but we will update the external issue as soon a fix is available. I advice starring the issue to get any updates moving forward and also keep measuring the developers impact."
Really? It's been a year and 6 months!
"At present, our Team is still working on a fix for this issue. Thought, we still don't have an ETA on resolution but we will update the external issue as soon a fix is available. I advice starring the issue to get any updates moving forward and also keep measuring the developers impact."
Really? It's been a year and 6 months!
mi...@gmail.com <mi...@gmail.com> #32
Just wanted to share something that has seemed to help us a lot. We turned in-context caching off for all our NDB models, and turn it on (explicitly) for lighter (smaller) models that get called frequently like the User: https://cloud.google.com/appengine/docs/python/ndb/cache#policy_functions
Not sure what version of the SDK this was introduced in, but we just define _use_cache = False within the model class and we're seeing this issue far less frequently (1% of the time vs 20%+).
Not sure what version of the SDK this was introduced in, but we just define _use_cache = False within the model class and we're seeing this issue far less frequently (1% of the time vs 20%+).
ra...@gmail.com <ra...@gmail.com> #33
Eric, that's the exact same response I got when I opened a support ticket A YEAR AGO.
We still run into this on a regular basis. Our code is now full of kludges to work around this issue.
It appears that ilevy@google.com located the source of the problem last February but nothing has been done to resolve it since then.
We still run into this on a regular basis. Our code is now full of kludges to work around this issue.
It appears that ilevy@google.com located the source of the problem last February but nothing has been done to resolve it since then.
il...@google.com <il...@google.com> #34
So a LinkedIn search shows that Isaac Levy left Google last August! Eric, is your support ticket still open? Can you ask them to assign this issue to somebody who actually still works for Google?
ra...@gmail.com <ra...@gmail.com> #35
Kris, I will! Can you share some of the work arounds/kludges that worked for you? I understand there is no silver bullet, but we're seeing this in mapreduce and we're down to only allowing one task to run at a time on a module instance. Not good.
I'll talk with support and see what the hell is going on.
I'll talk with support and see what the hell is going on.
lb...@gmail.com <lb...@gmail.com> #36
Eric, I created peer classes for my entities that extend db.Model . So if I have a MyEntity(ndb.Model) class defined in models.py, I create a MyEntity(db.Model) in models_db.py. You don't need to define any properties on the db model, because you're only going to use the key from it. Specify this class as the entity_kind in your mapreduce input_reader. An instance of the db entity will get passed to the mapper function. You can then take the key from that object, convert it to an ndb.Key and do a get() to get the ndb object.
Here's a barebones example:
output = yield mapreduce_pipeline.MapperPipeline(
"My Mapper",
"mapper_function",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader": {
"entity_kind": 'models_db.MyEntity',
'filters': filters,
'batch_size': 500,
},
"output_writer": {
"filesystem": "gs",
"gs_bucket_name": MAPREDUCE_BUCKET,
"output_sharding": "none",
},
},
shards=16
)
models.py:
class MyEntity(ndb.Model):
created = ndb.DateTimeProperty(auto_now_add=True)
name = ndb.StringProperty(required=True)
# other properties
models_db.py:
class MyEntity(db.Model):
pass
def mapper_function(db_entity):
entity = ndb.Key.from_old_key(db_entity.key()).get()
# do stuff
# yield something
Hope this helps!
Here's a barebones example:
output = yield mapreduce_pipeline.MapperPipeline(
"My Mapper",
"mapper_function",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader": {
"entity_kind": 'models_db.MyEntity',
'filters': filters,
'batch_size': 500,
},
"output_writer": {
"filesystem": "gs",
"gs_bucket_name": MAPREDUCE_BUCKET,
"output_sharding": "none",
},
},
shards=16
)
models.py:
class MyEntity(ndb.Model):
created = ndb.DateTimeProperty(auto_now_add=True)
name = ndb.StringProperty(required=True)
# other properties
models_db.py:
class MyEntity(db.Model):
pass
def mapper_function(db_entity):
entity = ndb.Key.from_old_key(db_entity.key()).get()
# do stuff
# yield something
Hope this helps!
Description
Version used: 1.0.0-rc1
Devices/Android versions reproduced on: Nexus 5x, API 27
Sample project:
Issue on StackOverflow:
CASE-1: Log output for locking and unlocking in landscape mode
02-20 16:30:14.159 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onCreate: orientation 2
02-20 16:30:14.159 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onCreate: no savedInstanceState
02-20 16:30:14.169 8296-8296/com.dkarmazi.viewmodelscoping D/MyViewModel: MyViewModel 55090662: created
02-20 16:30:14.183 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onResume: orientation 2
### LOCKED IN LANDSCAPE MODE
02-20 16:30:22.978 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onSaveInstanceState: SAVED_STATE_244798673
02-20 16:30:22.996 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onStop: orientation 2
### UNLOCKED IN LANDSCAPE MODE
02-20 16:30:33.177 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onStop: orientation 2
02-20 16:30:33.178 8296-8296/com.dkarmazi.viewmodelscoping D/MyViewModel: MyViewModel 55090662: onCleared
02-20 16:30:33.179 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onDestroy: orientation 2
02-20 16:30:33.241 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onCreate: orientation 1
02-20 16:30:33.241 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onCreate: saved string from savedInstanceState SAVED_STATE_244798673
02-20 16:30:33.242 8296-8296/com.dkarmazi.viewmodelscoping D/MyViewModel: MyViewModel 113479034: created
02-20 16:30:33.248 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onResume: orientation 1
02-20 16:30:33.705 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onSaveInstanceState: SAVED_STATE_218111434
02-20 16:30:33.710 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onStop: orientation 1
02-20 16:30:33.712 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 218111434, onDestroy: orientation 1
02-20 16:30:33.815 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 158140230, onCreate: orientation 2
02-20 16:30:33.815 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 158140230, onCreate: saved string from savedInstanceState SAVED_STATE_218111434
02-20 16:30:33.822 8296-8296/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 158140230, onResume: orientation 2
CASE-2: Log output for portrait mode:
02-20 16:38:10.283 8567-8567/? D/FragmentActivity: Activity 244798673, onCreate: orientation 1
02-20 16:38:10.283 8567-8567/? D/FragmentActivity: Activity 244798673, onCreate: no savedInstanceState
02-20 16:38:10.293 8567-8567/? D/MyViewModel: MyViewModel 55090662: created
02-20 16:38:10.301 8567-8567/? D/FragmentActivity: Activity 244798673, onResume: orientation 1
02-20 16:38:13.459 8567-8567/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onSaveInstanceState: SAVED_STATE_244798673
02-20 16:38:13.480 8567-8567/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onStop: orientation 1
02-20 16:38:17.704 8567-8567/com.dkarmazi.viewmodelscoping D/FragmentActivity: Activity 244798673, onResume: orientation 1
Problem:
In CASE 1, based on the logs, the bundle get somehow properly routed from Activity 244798673 to Activity 218111434 at `02-20 16:30:33.241`, but the ViewModel is not able to persist through this sequence of actions. Also, notice how ViewModel gets destroyed in CASE-1, but persisted in CASE-2.
Expected behavior:
`ViewModel` should not be destroyed in CASE-1.