User talk:Jakob.scholbach/Archives/2007/June

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Citation database

Nice to see this coming along. Here are some thoughts.

Many (most?) of us do not use IE, so interface elements based on that are useless. Firefox is quite popular, and Safari will be found among Mac users. Design to W3C web standards, not MS standards, then work around the bugs in IE.
Whatever you are doing with your character set, try something different, preferably using UTF-8. The source for each page should state that it is using UTF-8, and then you should be sure to actually do so. Your current method produces � ("�") characters where I expect to see "é" and the like.
The template I currently use is "{{citation}}", not "{{cite book}}", and it needs more fields filled out. In particular, it separates given name and family name for all authors. For some kinds of names it is not obvious how the name should be split, so the database needs to store (and provide) the split form.
Quality control will be vital. How will entries be added, altered, and validated?
Many important and recurring references will not be books. For example, seminal papers usually appear as journal articles. Using the {{citation}} template no distinction is necessary; just populate the relevant fields.
The first of January, 2007, ISBNs switched over to ISBN-13 format, and old ISBN-10 values should have the checksum tested then be automatically converted to the new format, with correct hyphenation. See here for some tools, and talk to Rich Farmbrough for Wikipedia-specific advice.

That should be enough to keep you busy! Keep up the good work. --KSmrq^T 05:47, 18 June 2007 (UTC)

Thanks for the quick response.

ad 1: I use Firefox, too. The only difference between the functionality of the DB with respect to different browsers is that in IE you can copy to the clipboard clicking at a button, whereas in Mozilla the user needs to do this manually (which is also possible). I will include the reference text which one may copy in the list of references, not only in the detailed view. In general I strive to make it as client-undepending as possible, but without renouncing advantages of some browsers (where IE is, as a matter of fact, usually advantegeous)

ad 2: I'm not sure I understand what you mean. I don't do anything particular with the characters and at my browser, I do get the correct display of French accents (like in Séminaire de ...), using the Browser's coding table UTF-8. So, what could be the reason for the bad display you get?

ad 3: OK, I didn't know that one. This makes everything much easier, though. I'll implemement it and will drop the book template then. First and last names are separated in the DB (but not shown separately for aesthetic reasons in the broad list, but only in the detailed view of an author). They also do appear separatedly in the citation template to be copied to WP articles.

ad 4: This is really the wound spot, I guess. For now, everyone can add items and edit fields. One the one hand it requires some additional work to set up a user registration system. On the other hand, I wonder a) if people would vandalize something like this b) if people would be willing to validate database entries. Do you have any kind of idea on that issue? One could try to get some flagged version of an entry and some reliable person needs to check if such an item is edited. But if the DB grows (which is what I hope), this is gonna be tedious and pretty little rewarding work. As an aside, I personally think this DB should on the long run be a Wikiproject. Possibly I'm biased and overemphasize it's impact, but it seems to be a somewhat core-ish information to have, right? I don't know of any freely accessible universal database of books and papers. (Do you?)

ad 5: Yes, this is the same as 3. It'll be done shortly.

ad 6: At the moment, the user is free to input a ISBN 13. I asked for a routine to give correct hyphenation of an ISBN and according to the guys there, there is no real algorithm, unless you know all the codes of all publishers, which I don't. But if you know of such an algorithm, I'll be happy to implement it. As a little remedy, there is a link to a page where one can get the correct hyphenation, in case somebody is as correct as you and wants to hyphen it properly.

Today I also saw that Mathscinet provides quite handy (i.e. easy to get and easy to parse) Bibtex references to papers. I'll write a little parser for this, too, so that one can populate the DB not only with information from WP articles, but also from there. Do you know any other sources of information of this kind? Also, where do you usually look up to get the URL of online pdf-files of journal papers? Thanks again, Jakob.scholbach 02:40, 19 June 2007 (UTC)

I feel it only fair to warn you in advance that this effort may require more effort than you anticipated going in! :-)

So far as I can tell, cross-platform Javascript provides no access to the clipboard (probably for security). Here are some statistics showing Firefox now accounts for over 1/3 of browser use. The button may feel warm and fuzzy for IE users, but cold and prickly for everyone else. So far as I can tell, cross-platform Javascript provides no access to the clipboard (possibly for security reasons). However, this feature of standard HTML may be of interest; a two key sequence, TAB and copy, is still friendly. One fine point: Formally, you should tell a browser what script language to assume for the page
Something has changed since I first looked, and most of the accented characters look OK now. However, I see the document being served with character encoding ISO-8859-1, not UTF-8. (In Firefox, select "Page Info" under the "Tools" menu.) There can be as many as four (five?) pieces to this puzzle.
- The server's own ideas, based on whatever it likes.
- An HTML tag,
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- A first line with an XML incantation (but see this guide),
  <?xml version="1.0" encoding="utf-8" ?>
- The actual text encoding of the page, as saved by a text editor or a generator script.
- (?) What the browser chooses to do, properly or improperly. For example, Firefox gives the user a "Character Encoding" option under the "View" menu.
Template options are a moving target; we have a choice of templates, and the fields of each template evolve over time. The best protection is to separate the database content from the generated citation. The intermediary is a declaration of how to fill in the template parameters from the database fields.
I do have some ideas about quality control. The vital supports needed are "immutability"/"versioning", and named authority. If I tag a database entry as "verified", that must only apply to the contents at the time of the tagging, not some later edited version. As well, tags must be attributed to users, implying that an entry can have multiple "verified" tags.
Great.
For ISBNs, first the trivial parts: Given an ISBN of either 10 or 13 digits, a simple digit calculation can confirm its checksum validity. Given a valid 10-digit ISBN, generating a correct 13-digit replacement is just as easy. Hyphenation is more awkward, but doable. (I strongly urge you to talk to Rich Farmbrough!) The vast majority of numbers will have a group identifier code of "0", meaning "English speaking area". Publisher numbers are partitioned into ranges, found here, for instance. For group 0 these are unlikely to change, so most of the hyphenations should be easy. Each group uses different partitions; here is the current table for group 0.
Group 0
Valid Publisher Numbers
00 – 19
200 – 699
7000 – 8499
85000 – 89999
900000 – 949999
9500000 – 9999999

Hope this helps! --KSmrq^T 23:59, 19 June 2007 (UTC)

The code I used for hyphenating and checksumming 10 digit ISBNs is under WP:AWB on a scripts page. I can't remember if it deals with 13 digit ones, or if that was "lost". I'm working on resurrecting my dead machines today, so may be able to help more when this is done. Rich Farmbrough, 12:10 20 June 2007 (GMT).

Group 0
Valid Publisher Numbers
00 – 19
200 – 699
7000 – 8499
85000 – 89999
900000 – 949999
9500000 – 9999999