Discussion of deletion and site archiving

Didn’t feel like following directions, eh?

EDIT: this has now become a discussion about deletion and site policy....

2 Likes

Nope. It was too stressful to not know.

Actually, speaking of stress, this helps me relax when programming is stressing me out so I figured I should share: Nice Ocean Waves - YouTube

2 Likes

Wow! Those waves were relaxing.

2 Likes

I know, right? Good stuff.

Is it possible to delete test threads?

Yes, it looks to me like the author can delete the topic by deleting the first post. In general, though, I'm inclined to treat it like a mailing list; the past is the past, and there's no need to delete everything that goes by. Let me know if I'm missing a compelling reason to delete this topic.

2 Likes

More junk to sort through when using the search function. Deleting is a mitigation not available in mailing lists.

3 Likes

Personally, I treat Discourse not just like a mailing list, for several reasons:

  • It's easier to use Markdown, e.g. for code snippets, because you have a preview.
  • You can edit a post, which I sometimes do. (You don't get mails for edits via the list interface, I guess.)
  • The web interface looks quite nice. :slight_smile:

So from my point of view, what Stephen says regarding the effect on the search results, makes sense.

Regarding the deletion itself, meanwhile the topic contains some interesting discussion, so maybe it should stay because of that. :slight_smile: On the other hand, if a topic is just for testing purposes and doesn't contain useful information, I think it makes sense to delete it after it's no longer needed.

1 Like

By the way, a comprise to get the best of both worlds - mailing list and web interface - could be getting updates of new messages via email, and if you want to reply, follow the "view item" link in the mail to reply in the web interface. That way you see the code markup and the edits that happened since you received the mail.

2 Likes

It's good to know that can be done. A number of years ago, I was part of a discussion group (a social organization, not technical) where we discovered that one of the members, with administrative access, had been deleting old posts from the archive as a sort of vigilante historical censorship. We discovered this while investigating a completely different matter. There were suspicions that one of the other user accounts had been taken over by someone wanting to scam the group. I never found proof but it looks like they looked through the membership list, which shows latest activity as well as email addresses and found an account which had not had activity in a few years and was registered with a Hotmail address. They then went to Hotmail, found the email address now available again and registered it. Then went back to the forum and requested a password reset, using the Hotmail address, gaining access to the old forum account. However, it didn't take long before someone had the reaction "That doesn't sound like so-and-so." This particular hacker got the technical parts right but then fumbled the social engineering part. The next logical thing to do was compare the user's recent posts from those years ago. That's when we discovered the deleted posts.

You see, this forum had a mailing list mode and several of us, myself included, had complete archives locally. Of course it's easier to search local archives than the web site so that's what I did and ended up quoting a post which had been deleted from the server. After that, we discovered many more deleted posts. They were a mixed bag. There was spam, clearly abusive content, stuff that was impolite but contained constructive criticism, stuff which was opinionated but should have been fine and other ones where I have no idea what the objection might have been. In any case, we would have been much worse off if the forum had been entirely cloud based, without some users having their own archive. Worrying about this sort of incident seems paranoid until it actually happens to you.

James

1 Like

No, those are good points. Discourse has a backup mechanism, and I've created one backup just as a test, but we should probably have a rolling backup schedule. Of course, if you're really concerned about post deletion, you need to either keep all the backups or perform automated comparison to see how much got deleted, which would be an entertaining devops kind of operation that I ... shouldn't allow myself to get sucked into. However, if someone else wants to write code to extract the relevant information from the site backups, that would be pretty cool.

3 Likes

Sounds like that would allow to detect that deletion has happened, but not why or by whom, so I think it would only be useful as a warning system for unusual activity with some threshold or manual review.

I think rolling backups sound like a good idea, but detecting deletions are probably easier by setting up a webhook in discourse and logging / monitoring those with some kind of script / bot.

sounds like an excellent idea!

Are the backups easily accessible? I wouldn't want to commit to it right now, but I'd be interested in taking a look.

Looks like we already have rolling backups and on demand exports.

You are free to download a complete Discourse export and migrate away from our free hosting at any time.

I’ve never worked with infrastructure so I don’t know if the current setup is sufficient / appropriate ?

@simonls & @dstorrs does this look ok to you? or should we be doing more?

Best wishes

Stephen

On the topic of detecting deletions;

  • If it is your own post I think you should be able to delete.

  • If you have a system administrator who is a ‘bad actor’ I’d suggest any defence you implement can likely be subverted because they are system administrator.

As a large number of users also archive via mailing list mode - I’m not sure there is much motivation for deleting.

Reasons why you would delete; spam, libellous or illegal material.

As I said in the other reply I’m not an infrastructure expert so would welcome learning more.

(And detecting deletions in large corpora is an interesting problem, with specific uses in large ‘Electronic Discovery’ cases e.g. the Enron dataset)

detecting deletions

I don't have a strong opinion on this, my post was mostly to point out an easier way (using web-hooks) if somebody really wants to implement a kind of logging system that detects deletions.


following is my view of backups, but my practical experience is very limited:

backups

Regarding backups I think it wouldn't hurt (if done correctly, see below) if someone downloads one of those backups at a regular interval, in case there is ever a problem with this instance hosted by discourse, but I am not sure how necessary that actually is.
It is quite likely that the backups handled by discourse are already handled better than most of us could easily do. So I would see downloading those backups and keeping them in yet another place as an extra, that borders on being paranoid (that said I guess with securing data it is difficult to be too paranoid :wink: ).

downloading vs testing

Another point would be downloading the backups vs actually testing if it is possible to instantiate a working discourse instance from that backup.

backup data is admin level access

I am not sure what is included in the backup, but considering a backup has to be able to restore user accounts so those users can login again, we shouldn't freely handout backup data, to "curious racketeers who want to write a script".

Basically I think finding a volunteer and than have them make extra backups of the backups, may actually decrease security, even if that volunteer has good intentions (maybe they make a mistake).
Considering even salted hashes that are probably in the user part of the backup, could be brute-forced with enough time, the amount of people that have access should stay limited.

(If user accounts are somehow not part of that backup that may be different)

other points of failure

The other view of this might be, if we don't have downloads of the backups, one worst case would be if an admin account gets compromised and is used to delete all the backups and then empty the current discourse instance.
So strictly speaking discourse would be a single point of failure, if backups are also somewhere else the instance could be just re-setup.

summary of my view:

  • downloads of backups could be a good thing, if access to those is extremely difficult for an attacker e.g. encrypted and preferably even offline/cold (not connected to internet).
  • if that can't be reasonably done by the current admins, it may not be worth it
  • if an person is "added" to do this, this person isn't simply a volunteer, but also an admin (because backups are sensitive data)

Overall I am not experienced with handling and backing up large amounts of data, I never worked as a sysadmin, so there may be other people more qualified to speak from experience.

3 Likes

Great post, thank you! :slight_smile:

Thanks in particular for the notion that people having access to full backups kind of have admin rights. I think the ability to make and read backups should be limited to people who are also trusted to maintain the Racket Discourse instance as admins.

Yes, I think you've hit the nail on the head. I think this is a space where the most sensible thing is to trust that discourse is handling things correctly. Further, one of the authors of discourse (Jeff Atwood) asserts bluntly that in discourse, it should not be possible to permanently delete data in a way that does not appear in internal tables, so unwanted/surreptitious deletion should not be possible in most plausible scenarios.